Huh, that’s weird, I run a C++ compiler directly on my GPU code. The only difference between CPU and GPU code at the function level is whether I tag it with a __global__ macro or not, and lots of functions compile and run for both CPU and GPU.
Memory layout, thread scheduling, and barriers are not features of the C language and have nothing to do with whether your C is “normal”. Those are part of the programming model of the device you’re using, and apply to all languages on that device. Normal C on an Arduino looks different than normal C on an Intel CPU which looks different than normal C on an NVIDIA GeForce.
You can look at C++ AMP too, it runs with all GPUs that support DX11 on Windows, and is a part of the Windows SDK. It's implemented by AMD ROCm on Linux, which also implements HIP/CUDA.
Normal C/C++ can run fine on modern GPU architectures.
Memory layout, thread scheduling, and barriers are not features of the C language and have nothing to do with whether your C is “normal”. Those are part of the programming model of the device you’re using, and apply to all languages on that device. Normal C on an Arduino looks different than normal C on an Intel CPU which looks different than normal C on an NVIDIA GeForce.