Hacker News new | past | comments | ask | show | jobs | submit login

It's not really suited to CUDA/OpenCL; for the purposes of this discussion, we can treat the two as the same because the execution model of actual kernels is almost identical. The problems are

- the size of the grid is fixed at kernel launch time, so you can't arbitrarily spawn more work

- the completion of one block cannot depend on the execution of another block, which means inter-block communication within a kernel is forbidden

So you could do some sort of block-local actor operation then do an exchange of the halo cells between blocks in another kernel (the only global synchronization point allowed in CUDA/OpenCL), but that seems really painful, especially if actors are sparse within your grid. A work queue approach would probably work better; you'd certainly get better utilization and load balancing than trying to spatially partition a large grid.




While extremely limited, I thought shared memory enabled interblock communications...? It's important to note I have no applications for this, just researching.


No, shared memory is only for intra-block communication. __syncthreads() only ensures that every thread in a block is at a particular point rather than every block in a grid.

Take Fermi, for example--you can potentially have 128 blocks running concurrently (8 blocks per multiprocessor, 16 multiprocessors on GF110), but you can launch a grid of 65535x65535 blocks in a single kernel. As a result, if you try to do arbitrary global synchronization, you'd have a state explosion (PDF: http://www.gdiamos.net/papers/stateExplosion.pdf ). The best way to solve a problem with significant interaction between data elements is to use a persistent work queue (as described in PDF: http://www.tml.tkk.fi/~timo/publications/aila2009hpg_paper.p... ).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: