It depends, on high contention, waiting writer threads will only touch a single shared cache line containing the lock (all the time if spinning, once if there is a proper thread queue). In a lock free algo, the writers might be stepping on each other toes all the time, with heavy cache-line pingpoing, which is not good for general throughput.
Lock free algos are very good when you have many pure readers and few or occasional writers.
In my experience, once you get up to the 100s of threads potentially touching global structures, locks impeded scalability significantly. Even if each of those threads is modifying the global data. The key is to design your algorithms around the kind of structures that are amenable to lock free algorithms. (The experience I'm referring to: http://www.scott-a-s.com/files/pldi2017_lf_elastic_schedulin...)
The Vega64 GPU that AMD just released has 4096 "threads" operating at one time (actually in flight), with up to 40,960 of them "at the ready" at the hardware level (kinda like Hyperthreading, except GPUs keep up to ~10 threads per "shader core" in memory for quick swap in-and-outs). Subject to memory requirements of course. A program that uses a ton of vGPR Registers on the AMD system may "only" accomplish 4096 threads at a time, and maybe only 5 per core (aka 20,480) of them are needed for maximum occupancy.
Its a weird architecture because of how each "thread" shares an instruction pointer (ie: NVidia has 32 threads per wavefront, AMD has 64 workitems per Work Group), so its not "really" the same kind of "thread" as in Linux pthreads. But still, the scope of parallelism on a $500 GPU today is rather outstanding.
All of these threads could potentially hit the same global memory at the same time. I mean, if you want bad performance of course, but its entirely possible since the global memory space is shared between all compute units in a GPU.
Yes. See the paper I linked to. I work on a parallel and distributed dataflow system called IBM Streams. When we have thousands of our operators running on a large system, it is realistic for there to be 100s of threads.
Lock free algos are very good when you have many pure readers and few or occasional writers.