PCIe busses are like a tree with “hubs” (really switches).
Imagine you have a PC with a PCIe x16 interface which is attached to a PCIe switch that has four x16 downstream ports, each attached to a GPU. Those GPUs are capable of moving data in and out of their PCIe interfaces at full speed.
If you wanted to transfer data from GPU0 and 1 to GPU2 and 3, you have basically 2 options:
- Have GPU0 and 1 move their data to CPU DRAM, then have GPU2 and 3 fetch it
- Have GPU0 and 1 write their data directly to GPU2 and 3 through the switch they’re connected to without ever going up to the CPU at all
In this case, option 2 is better both because it avoids the extra copy to CPU DRAM and also because it avoids the bottleneck of two GPUs trying to push x16 worth of data up through the CPUs single x16 port. This is known as peer to peer.
There are some other scenarios where the data still must go up to the CPU port and back due to ACS, and this is still technically P2P, but doesn’t avoid the bottleneck like routing through the switch would.
Imagine you have a PC with a PCIe x16 interface which is attached to a PCIe switch that has four x16 downstream ports, each attached to a GPU. Those GPUs are capable of moving data in and out of their PCIe interfaces at full speed.
If you wanted to transfer data from GPU0 and 1 to GPU2 and 3, you have basically 2 options:
- Have GPU0 and 1 move their data to CPU DRAM, then have GPU2 and 3 fetch it
- Have GPU0 and 1 write their data directly to GPU2 and 3 through the switch they’re connected to without ever going up to the CPU at all
In this case, option 2 is better both because it avoids the extra copy to CPU DRAM and also because it avoids the bottleneck of two GPUs trying to push x16 worth of data up through the CPUs single x16 port. This is known as peer to peer.
There are some other scenarios where the data still must go up to the CPU port and back due to ACS, and this is still technically P2P, but doesn’t avoid the bottleneck like routing through the switch would.