It depends, on high contention, waiting writer threads will only touch a single ...

scott_s · on Nov 2, 2017

In my experience, once you get up to the 100s of threads potentially touching global structures, locks impeded scalability significantly. Even if each of those threads is modifying the global data. The key is to design your algorithms around the kind of structures that are amenable to lock free algorithms. (The experience I'm referring to: http://www.scott-a-s.com/files/pldi2017_lf_elastic_schedulin...)

banachtarski · on Nov 2, 2017

Is a program that has 100s of threads realistic?

dragontamer · on Nov 2, 2017

The Vega64 GPU that AMD just released has 4096 "threads" operating at one time (actually in flight), with up to 40,960 of them "at the ready" at the hardware level (kinda like Hyperthreading, except GPUs keep up to ~10 threads per "shader core" in memory for quick swap in-and-outs). Subject to memory requirements of course. A program that uses a ton of vGPR Registers on the AMD system may "only" accomplish 4096 threads at a time, and maybe only 5 per core (aka 20,480) of them are needed for maximum occupancy.

Its a weird architecture because of how each "thread" shares an instruction pointer (ie: NVidia has 32 threads per wavefront, AMD has 64 workitems per Work Group), so its not "really" the same kind of "thread" as in Linux pthreads. But still, the scope of parallelism on a $500 GPU today is rather outstanding.

All of these threads could potentially hit the same global memory at the same time. I mean, if you want bad performance of course, but its entirely possible since the global memory space is shared between all compute units in a GPU.

scott_s · on Nov 2, 2017

Yes. See the paper I linked to. I work on a parallel and distributed dataflow system called IBM Streams. When we have thousands of our operators running on a large system, it is realistic for there to be 100s of threads.

gpderetta · on Nov 2, 2017

there are fairly mainstream machines with more than 100s of hardware threads.