It's not really suited to CUDA/OpenCL; for the purposes of this discussion, we c...

darkxanthos · on Dec 26, 2010

While extremely limited, I thought shared memory enabled interblock communications...? It's important to note I have no applications for this, just researching.

tmurray · on Dec 26, 2010

No, shared memory is only for intra-block communication. __syncthreads() only ensures that every thread in a block is at a particular point rather than every block in a grid.

Take Fermi, for example--you can potentially have 128 blocks running concurrently (8 blocks per multiprocessor, 16 multiprocessors on GF110), but you can launch a grid of 65535x65535 blocks in a single kernel. As a result, if you try to do arbitrary global synchronization, you'd have a state explosion (PDF: http://www.gdiamos.net/papers/stateExplosion.pdf ). The best way to solve a problem with significant interaction between data elements is to use a persistent work queue (as described in PDF: http://www.tml.tkk.fi/~timo/publications/aila2009hpg_paper.p... ).