HIP – Convert CUDA to Portable C++

waynecochran · on March 18, 2018

When I write CUDA kernels I am very specific in optimizing for nVidia GPU features:

     * I have 1024 threads per block
     * I have 48KB of shared memory per block
     * I have 32 threads per warp and need make sure that 
       global to local memory reads are coalesced.
     * Think SIMD and avoid branching as much as possible

My kernels usually follow a specific pattern:

     (1) Read global memory into local memory: making sure that
         if thread i reads memory[n], then thread i+1 
         reads memory[n+1].
     (2) __syncthreads().
     (3) Do computation in in most thread balanced way possible.

This very specific pattern doesn't really work elsewhere. In fact, optimizing in this fashion and then porting to C++ and elsewhere loses the specific optimization. Programming in more general way loses all the things that makes the program fast. Anyway, I definitely going to look this over more.

dragontamer · on March 19, 2018

AMD's "HCC" is C++ with templates added onto it. Unlike CUDA, HCC implements all relevant features in C++ Templates alone.

If you simply do 64-thread per "warp" (AMD's groups are per-64) and 32KB LDS aka Shared memory per block, you would be able to write portable high-performance code between AMD GPUs and NVidia GPUs.

AMD seems like its a bit behind with regards to GPGPU adoption. But AMD's hardware seems to be a good bit cheaper. You can get HBM2 models at ~$2000 from AMD (Firepro WX9100, which is Vega architecture)

Although... as they say... hardware is cheap. I'm sure most datacenters will prefer the $8000 NVidia V100 instead, because there are more people using that hardware. In particular, its easier to get started with a V100 due to AWS and other cloud-compute offerings.

AlphaSite · on March 19, 2018

It’s portable Cpp in the sense that it’s portable to AMD GPUs, rather than being exclusively comparable with Nvidias Cuda.

Wehrdo · on March 18, 2018

Could anyone who's used HIP comment on how it compares to programming in raw CUDA?

I considered using it for a project recently, but ultimately decided against it because I didn't need to be able to run on AMD systems.

waynecochran · on March 18, 2018

Wait ... if you write your program in HIP you get both CUDA and OpenCL (which includes AMD) for free right?

dragandj · on March 18, 2018

There is no OpenCL in the story there. AMD created HIP as a direct copy of CUDA. It then gets compiled to work on AMD hardware, or can be translated to CUDA.

AMD also supports OpenCL (which I prefer to both CUDA and HIP), but it's not connected to HIP.

TomV1971 · on March 18, 2018

Last time I checked, HIP didn’t support reading from textures.

That’s something that’s not only useful for pure graphics.

So it is (or was?) not a straight substitute for CUDA.

tromp · on March 18, 2018

I wonder if this performs well enough to port cryptocurrency miner software. I hope to find out with

https://github.com/mimblewimble/grin/issues/806

ipunchghosts · on March 18, 2018

The fact that HIP exists is why we choose cuda for our program 8 years ago. Opencl and cuda were the choices. I thought about how I would design such a gpu language aND then looked at cuda and opencl. What I dreamt up matched cuda exactly. I forsake one day that it would no longer matter what you wrote in cuda because it could be easily translated to any type of gpu hardware. In the past few years, Portland group made cuda do cpu compiler and now HIP.

aerodog · on March 18, 2018

*foresee:)

mastazi · on March 18, 2018

Mmmhhh I would say "foresaw" ;-)

ipunchghosts · on March 18, 2018

Sent from my Android.

petermcneeley · on March 18, 2018

Is there a legal question here as CUDA is proprietary API?

O_H_E · on March 18, 2018

That is an interesting question, but I don't think so, because that repo shouldn't contain anything related to the CUDA binaries.

(Do NOT take my word for it, I have no idea about what I am talking about)

petermcneeley · on March 18, 2018

Obviously im thinking of this type of legal issue. https://www.theregister.co.uk/2017/06/09/intel_sends_arm_a_s... . Emulation is simply an implementation of API.

bringtheaction · on March 18, 2018

Also Oracle vs Google in their legal battle about the use of Java APIs in Android. https://en.wikipedia.org/wiki/Oracle_America,_Inc._v._Google...

Google won out over Oracle in the end but it took a long time — from 2012 until 2016; four years of court cases — with some courts finding that API structures were copyrightable and others found that they were not, or that reimplementation was fair use. I guess we now have a precedent thanks to this but it could still be an issue? IANAL so I don’t know.

cma · on March 18, 2018

Oracle may not have been able to use SQL anymore if the ruling had gone in their favor.

pjmlp · on March 19, 2018

Oracle haters keep bringing this up, it doesn't make any sense.

SQL is an international standard, that one needs to pay for, it isn't available for free.

Oracle already pays for SQL certifications, unlike Google does for Java.

O_H_E · on March 18, 2018

Could you elaborate???

bitL · on March 18, 2018

Somebody speculated that the lawsuit was used by Oracle to establish firm boundaries in what can be considered a copyright infringement, with the intent to implement Amazon/GCE APIs for drop-in compatibility for their cloud offering. So they tried all kinds of ridiculous stuff to see what works/what doesn't in order to properly cover their backs while "stealing" other APIs.

pjmlp · on March 19, 2018

Given that Google helped sink Sun by ripping them off, and Sun wasn't in the position to pay for lawyers, to apply the same medicine that they did to Microsoft....

James Gosling interview, at 57:42

https://www.youtube.com/watch?v=ZYw3X4RZv6Y

"unwilling to help us pays the bills", so nice for the Do No Evil company.

Oracle does pay for their ANSI/ISO SQL certifications.

mamon · on March 18, 2018

So, the logical next step would be what? Apply that tool to Tensorflow to make it run on AMD GPUs?

TomVDB · on March 18, 2018

HIP is only useful to convert CUDA source code.

Tensorflow uses the cuDNN library, which is closed source. There is nothing for HIP to convert.

tgtweak · on March 18, 2018

AMD is working on a cudnn comptability layer iirc (MiOpen) and is ROCm group has created a cuda transpiler (to intermediary HIP then to amd binary via hcc)

https://instinct.radeon.com/en/6-deep-learning-projects-amd-...

mastazi · on March 18, 2018

Wait, I thought that AMD's cuDNN replacement was hipDNN, now I'm confused https://github.com/ROCmSoftwarePlatform/hipDNN

Edit: OK I was reading the docs, I think I got it: hipDNN is a wrapper that (once finished) will search and replace calls (from cuDNN to hipDNN), then hipDNN itself, in turn, will call MIOpen, not sure if that's right, I would appreciate if someone who knows more could confirm

nl · on March 19, 2018

AMD is always working on random half-assed attempts to get to parity with NVidia for neural network training.

If they'd chosen one approach 5 years ago and put decent resources behind it they might be competitive by now.

TomV1971 · on March 18, 2018

Yes, but it means AMD will have to do the original work. They can’t use a CUDA to HIP converter.

O_H_E · on March 18, 2018

Was just thinking of that, the tensorflow repo have a very big issue where people are trying to do that

thewizardofaus · on March 18, 2018

Yup! I spent one very LONG weekend trying to get tensor flow running on my amd GPU. Managed to finally get it running but it was slower than using my CPU ....

O_H_E · on March 19, 2018

Oh NO, that is both hilarious and ugly.

and interesting

shritesh · on March 18, 2018

It has been in the works for a long time but I don't believe that it is in a usable state yet.

https://github.com/ROCmSoftwarePlatform/hiptensorflow

Fronzie · on March 18, 2018

The combination of tensorflow and codeplay's computeCPP might already work. At least they have added SyCL support for the linear algebra library (Eigen)

autopoiesis · on March 18, 2018

I think they also have some TensorFlow-specific SYCL kernels; I'm not sure it all goes through Eigen. Plus they're at TF 1.6, whereas AMD's HIP version seems to be stuck on TF 1.0. AMD's method of maintaining a pure fork and running hipify on every upstream commit doesn't seem very sustainable...

Fronzie · on March 18, 2018

From Codeplay's blogs it seems they ported the tensor part of Eigen to SyCL. The Matrix part is not supported under SyCL. Tensorflow does have options to compile models to custom code (XLA), that part might not use Eigen.

xiphias · on March 18, 2018

,,HIP is not intended to be a drop-in replacement for CUDA, and developers should expect to do some manual coding and performance tuning work to complete the port.''

It's a great start, but I'm sure a lot of cases are not yet handled (like asm instructions?)