I worked on PS3 titles that used SPUs and I can't tell you how relieved I was to find that the PS4 did away with them. The fear that the next PS would have "64 SPUs and a marginally faster PPU" was real.
The analogy I gave to my friends at the time was working in a restaurant kitchen with a tiny stovetop and 2-dozen microwave ovens; does some things really fast, but only if you can cut them up into small pieces that are microwave-friendly.
Both the PS3 and X360 CPUs featured no out of order execution, no speculative execution, no automatic prefetch, and a 500 cycle latency for hitting main mem outside of cache.
The hardware design was based on 1) This is currently the only way to hit that clock rate at this cost 2) Hyperthreading effectively halve all latencies 3) It's a fixed platform, the compiler should know exactly what to do. 3 was pretty laughable. It is only technically possible if you have huge linear code blocks without branching or dynamic addressing.
I would remind junior programmers that the PS2 ran at 300Mhz and had a 50 cycle memory latency and the PS3 ran at 3Ghz but had a 500 cycle latency. So, if you are missing cache, your PS3 game runs like it's on a PS2.
On the other hand, a lot of people overreacted to the manual DMA situation of SPU programming. DMAs from main mem into SPU mem had a latency of.... 500 cycles! Once people put 2 and 2 together, SPU programming became less scary. Still a pain in the ass. And, a lot of work to reach peak performance. But, more approachable for sub-optimal tasks.
Also that damn load hit store penalty. PPEs and Xenon cores could only detect and stall if a read was for an address that was currently being flushed out of the write buffers. Most sane uarchs will bypass out of the store queue, but those cores needed to flush all of the way out to L1 and read it back in.
Also, IIRC, shift by variable amount was microcoded and would take a cycle for each bit distance shifted.
> Also, IIRC, shift by variable amount was microcoded and would take a cycle for each bit distance shifted.
Wait, it was possible to ship a CPU in 2006 without a barrel shifter? ARM1 was from 1985!
If it was something silly like "only constant bit shifts use the barrel shifter", I'm surprised that compilers didn't compile variable shifts as a jump table to a bunch of constant shift instructions... :)
Any logical shift, rotate, bit field extract (by constant) and more all in one single cycle instruction that's been included since the earliest POWER days. They had a barrel rotater in the core for that instruction.
The only thing that makes any sense to me is that somehow it would have been too expensive to rig up another register file read port to that sh input, so they just pump it as many times needed with sh fixed to 1. They seemed to be on some gate count crusade that might have payed off if they were able to clock it faster at the end of the day. It took the industry a bit to figure out that ubiquitous 10Ghz chips weren't going to happen, and the hardest lessons would have been right in that design cycle. : \
> If it was something silly like "only constant bit shifts use the barrel shifter", I'm surprised that compilers didn't compile variable shifts as a jump table to a bunch of constant shift instructions... :)
Variable shift isn't the most common op in the world, so as far as I know it was just listed as something to avoid if you're writing tight loops.
It really isn’t - the individual elements in a GPU can access RAM directly, while you needed to wheelbarrow data (and code!) to and from SPUs manually. This was the major pain point, not the extreme (for that time) parallelism or the weird-ish instruction set.
It depends on what you mean by "access RAM directly", both SPUs and GPUs (at least on PS4) can read/write system memory. Both do this asynchronously. If you want random access to system memory you will die on a GPU even sooner than on an SPU since latencies are much bigger.
So there is not much difference imho and the GP is correct.
SPUs read memory asynchronously by requesting a DMA; GPUs read memory asynchronously too, but this is not explicitly visible to the programmer, and serious infrastructure is devoted to hiding that latency. The problem with SPUs was never that they are slow, rather than they are outright programmer-hostile.
>but this is not explicitly visible to the programmer
, and serious infrastructure is devoted to hiding that latency
What do you mean? You can hide latency on SPU by double-buffering the DMA but in a shader there is no infrastructure at all and no way to hide unlike SPU, you just block until the memory fetch completes before you need the data.
> they are outright programmer-hostile
Depends on the programmer I guess, I enjoyed programming SPUs, don't know personally anybody who had complaints. Only read about the "insanely hard to program PS3" on the internet and wonder "who are those people?". It's especially funny because the RSX was a pitiful piece of crap with crappy tooling [+] from NVidia yet nobody complaining about SPUs mentions that.
[+] Not an exaggeration. For example, the Cg compiler would produce different code if you +0/*1 random scalars in your shader and not necessarily slower code too! So one of the release steps was bruteforcing this to shave off few clocks from the shaders.
> The SPU approach is pretty much what a GPU does, but instead of 64 they have hundreds if not thousands.
Eh, only if you count every SIMD lane as a separate "core" like GPU manufacturer marketing does. More realistically, you should count what NVIDIA calls SMs, where the numbers are more comparable (GeForce RTX 3080 has 80, for example).
> I wonder if that university still uses their PS3 based supercomputer cluster
Almost certainly not. A typical lifetime of a supercomputer is around 5 years, give or take. After that the electricity they consume makes it not worth continuing to run them vs. buying a new one.
AFAIK both Titan and Jaguar received significant mid-life upgrades. So in such a scenario 7 years sounds reasonable.
(At a previous job, we had a cluster that was about 15 years old. Of course, it had been expanded and upgraded over the years, so I'm not sure anything was left of the original. Maybe some racks and power cables.. :) )
The hype was probably mostly from "Crazy" Ken Kutaragi, who had a knack for dialing the Playstation hype beyond 11. The PS2 was already supposed to replace your home PC and revolutionize ecommerce and online gaming and plug yourself into the Matrix [1]. In the past he was compared to Steve Jobs by some, but in hindsight he seems more like Sony's Baghdad Bob.
I've heard on the grapevine that the PS3's OtherOS facility was internally thought of as another go at the same idea. "Look, judge, it's a general purpose computer for reals this time. Your own universities are using it in super computing clusters, without ever launching a game".
Being forced kicking and screaming into the multicore on the PS3 improved our PC codebase by quite a good margin. So what I'm trying to say is that the hype was real, we have gone multicore, just not with the Cell architecture.
Did the 360 not drag things into multicore very similarly? Pretending its CPU has one core is a pretty similar experience to pretending a PS3 has no SPEs.
Afaik these university projects were done to test out how Cell-based supercomputers would behave. They're not competitive with up-to-date Supercomputers.
In my lab I found a couple of PS3s lying around several years ago, that hadn't been used in quite a while. (One of them may or may not have been adopted for less scientific purposes …)
In 2012 the Air Force Research Laboratory built a supercomputer cluster from 1760 PS3s that I think was in practical use for a while[0]. I recall reading online that the Air Force struck a special deal with Sony to buy some of the last remaining PS3s that had not been updated to no longer support Linux installation during manufacturing, but that isn't mentioned in this source.
>The Condor Cluster project began four years ago, when PlayStation consoles cost about $400 each. At the same time, comparable technology would have cost about $10,000 per unit. Overall, the PS3s for the supercomputer's core cost about $2 million. According to AFRL Director of High Power Computing Mark Barnell, that cost is about 5-10% of the cost of an equivalent system built with off-the-shelf computer parts.
>Another advantage of the PS3-based supercomputer is its energy efficiency: it consumes just 10% of the power of comparable supercomputers.
I wonder how significant the cost and energy savings were by the time the project was finished, and how long the cluster was actually used.
I can think of a ton of number crunching you can do where each node having 256 wouldn’t be awful. Like maybe brute forcing stuff where they can just segment the use case to each node. But it would certainly be a specialized case.
I remember speculation at the time that the PS3 would be able to borrow from other cell powered appliances in your house when it was running, i.e. your toaster, to enhance its power.
Actually, I just did a quick Google on this and it was "Sony" themselves that appeared to mention this! [1]
Hmm, probably not. Seems like the biggest PS3 cluster had 1,760 units and was rated at 500 teraflops (single precision float, presumably).
An NVIDIA A100 GPU does about 20 teraflops. So you only need 25 of those chips to match the theoretical rating, and they have many other advantages like much higher memory per core, etc.
Have you got the NVIDIA numbers right? A NVIDIA DGX Station A100 (desktop/workstation sized computer with 4x A100 and draws 1.5 kW of power, is rated at 2.5 petaFLOPS, so a good 30+% more than the PS3 cluster.
Also the PS3 apparently drew up to 200w, so a cluster that size would have drawn 352 kW.
The units being used here are likely different. In most press releases Nvidia uses their "tensor core" performance, usually with either sparsity or 16 bit data. A single A100 is said to have 320 teraflops of "tensor float" performance but only 19 teraflops of "normal" full FP32 performance.
This is way out of my field so I don't know the whole implications, but my understanding is Nvidia cards cam only reach these speeds at the loss of precision or full functionality, so it's an apples to oranges comparison versus non-nvidia chips.
I mean, the Cell SPU design is everywhere, it's just now part of every GPU and called compute shader. Vector processors are a good idea, but they have to be accessible to programmers through standard, easy-to-use, ideally-cross-platform APIs or invariably they won't get used to their full potential.
I wonder if that university still uses their PS3 based supercomputer cluster