Convenient authoring doesn't necessarily make it a good fit for the hardware. Ad...

Convenient authoring doesn't necessarily make it a good fit for the hardware. Add in enough divergence and your GPU code is going to be matched or outperformed by a competent CPU implementation (on a chip of comparable size). Branchless code can result in substantial speedups on either.

To be fair though, modern GPUs are pretty good at branching and latency hiding, while numpy-style code has poor data locality unless you have a magic compiler.