No Intel either. The port would be easy - gpuintrin.h abstracts over the intrinsics, provide an implementation for those, write a loader in terms of opencl or whatever if you want to run the test suite.
The protocol needs ordered load/store on shared memory but nothing else. I wrote a paper trying to make it clear that load/store on shmem was sufficient which doesn't seem to be considered persuasive. It's specifically designed to tolerate architectures doing slopping things with cache invalidation. It could run much faster with fetch_or / fetch_and instructions (as APUs have, but PCIe does not). It could also hang off DMA but that isn't implemented (I want to have the GPU push packets over the network without involving the x64 CPU at all).