> C++ has threads which ACTUALLY run in parallel on the CPU. Why bother complicating the language further?
Actual parallelism is surprisingly slow and resource heavy in practice. When everything is on one core, you keep things local in L3, L2, L1.
However, add on a 2nd core, then you suddenly have "ping ponging". Lets say core#0 has data in L1 cache, and now Core#1 needs to read-modify-write to it. That means the cores need to:
1: Core#1 begins to mark that cache-line as exclusive.
2. Core#0 needs to then respond by marking the line as invalid. Then Core#0 ejects the data from L1, and passes it to Core#1.
3. Core#1 can finally begin to work on the data.
4. If Core#0 uses the data again, Core#1 must do step#2 again.
--------
This is called "Ping-ponging". L1 cache read/writes is 1-nanosecond, but ping-pongs can be 30nanoseconds or slower, 30x slower for no reason.
You don't want to add another core to your problem unless you're _actually_ getting a speed benefit. Its more complex than it may seem at first glance. You very well could add a bunch of cores and then suddenly your program is way slower because of these kinds of issues.
Another problem: False sharing. Core#0 is working on "int x", and Core#1 is working on "int y", but x and y are on the same cacheline. So they have to ping-pong even though they never actually touch the same data.
Your example implies some combination of (a) excessive data sharing between threads (b) possible need for core pinning.
If thread A is going to need to read all the data touched by thread B, then it's unclear why you've split the task across threads. If there are still good reasons, then probably pin A and B to core N (never pin anything to core 0, unrelated story), and let them interleave there.
If that doesn't make sense, then yep, you'll have to face the cost of ping-ponging.
In my benchmark of a simple ledger I generate and do 25,021,365 (25 million) random transactions per second. (Withdraw and deposit over 80,000 accounts)
Paralellise it via sharding and I get 80,958,379 transctions per second.
If I remove the randomness, I get 134,600,233 transactions per second. If I parallelise with 12 threads I get 931,024,042 transactions per second.
My point being, if you paralellise properly and do not use shared data, then you can boost performance by a multiple of the number of threads.
This is not so likely in most workloads unless multiple threads are hitting very close or adjacent regions in memory. Most multithreading workloads work on memory not so close together. It's good to be aware of it, but to suggest it is the most likely outcome from multithreading is misleading.
Actual parallelism is surprisingly slow and resource heavy in practice. When everything is on one core, you keep things local in L3, L2, L1.
However, add on a 2nd core, then you suddenly have "ping ponging". Lets say core#0 has data in L1 cache, and now Core#1 needs to read-modify-write to it. That means the cores need to:
1: Core#1 begins to mark that cache-line as exclusive.
2. Core#0 needs to then respond by marking the line as invalid. Then Core#0 ejects the data from L1, and passes it to Core#1.
3. Core#1 can finally begin to work on the data.
4. If Core#0 uses the data again, Core#1 must do step#2 again.
--------
This is called "Ping-ponging". L1 cache read/writes is 1-nanosecond, but ping-pongs can be 30nanoseconds or slower, 30x slower for no reason.
You don't want to add another core to your problem unless you're _actually_ getting a speed benefit. Its more complex than it may seem at first glance. You very well could add a bunch of cores and then suddenly your program is way slower because of these kinds of issues.
Another problem: False sharing. Core#0 is working on "int x", and Core#1 is working on "int y", but x and y are on the same cacheline. So they have to ping-pong even though they never actually touch the same data.