Distributed computing has high flops, but very low connection speeds.
LUMI has 200Gbit connections between nodes, or roughly 25GBytes/sec: faster than PCIe 3.0 x16 (15.8 GBps).
In effect: supercomputers can share "remote memory" as if it were local (RDMA protocol). As such, you can treat the entire RAM-space as if it were unified (your 64-bit pointers can be unified across the entire supercomputer, your data-structures can be distributed and always accessed through a 200Gbit-pipeline).
--------
As it turns out: you need a very high I/O connection to truly sustain supercomputing workloads. A lot of these things turn out to be just crazy big matrix-multiplication problems that require a fair amount of coordination between all nodes.
You can't share a problem like that on distributed compute. At best, you can only share problems that can fit on one machine (under 32GBs of RAM). In contrast, these supercomputers can work on 100+ TBs of shared-RAM problems with 100,000+ TBs of shared storage (such as simulating quantum effects). The shared storage is accessed at 2TB/s speeds, and is accelerated with Flash SSD cache layers.
---------
As some people say: the job of a supercomputer is to turn everything into an I/O constrained problem. As such, a HUGE amount of money is poured into making I/O as fast as reasonably possible. You don't want your PFlop-scale machine to be throttled by slow storage or communications.
Broken record time, but it's the latency that's most important, not the bandwidth which you might have with generic Ethernet. Also trades-off in the fabric topology; I don't remember if Cray are still using Dragonfly.
LUMI has 200Gbit connections between nodes, or roughly 25GBytes/sec: faster than PCIe 3.0 x16 (15.8 GBps).
In effect: supercomputers can share "remote memory" as if it were local (RDMA protocol). As such, you can treat the entire RAM-space as if it were unified (your 64-bit pointers can be unified across the entire supercomputer, your data-structures can be distributed and always accessed through a 200Gbit-pipeline).
--------
As it turns out: you need a very high I/O connection to truly sustain supercomputing workloads. A lot of these things turn out to be just crazy big matrix-multiplication problems that require a fair amount of coordination between all nodes.
You can't share a problem like that on distributed compute. At best, you can only share problems that can fit on one machine (under 32GBs of RAM). In contrast, these supercomputers can work on 100+ TBs of shared-RAM problems with 100,000+ TBs of shared storage (such as simulating quantum effects). The shared storage is accessed at 2TB/s speeds, and is accelerated with Flash SSD cache layers.
---------
As some people say: the job of a supercomputer is to turn everything into an I/O constrained problem. As such, a HUGE amount of money is poured into making I/O as fast as reasonably possible. You don't want your PFlop-scale machine to be throttled by slow storage or communications.