Really cool. I remember when Sony really pushed the idea that the Cell arch woul...

shaggie76 · on Oct 20, 2021

I worked on PS3 titles that used SPUs and I can't tell you how relieved I was to find that the PS4 did away with them. The fear that the next PS would have "64 SPUs and a marginally faster PPU" was real.

The analogy I gave to my friends at the time was working in a restaurant kitchen with a tiny stovetop and 2-dozen microwave ovens; does some things really fast, but only if you can cut them up into small pieces that are microwave-friendly.

NikolaeVarius · on Oct 20, 2021

Im not an expert at all, but I seem to recall issues also arising from the inability to do out of order execution.

corysama · on Oct 20, 2021

Both the PS3 and X360 CPUs featured no out of order execution, no speculative execution, no automatic prefetch, and a 500 cycle latency for hitting main mem outside of cache.

The hardware design was based on 1) This is currently the only way to hit that clock rate at this cost 2) Hyperthreading effectively halve all latencies 3) It's a fixed platform, the compiler should know exactly what to do. 3 was pretty laughable. It is only technically possible if you have huge linear code blocks without branching or dynamic addressing.

I would remind junior programmers that the PS2 ran at 300Mhz and had a 50 cycle memory latency and the PS3 ran at 3Ghz but had a 500 cycle latency. So, if you are missing cache, your PS3 game runs like it's on a PS2.

On the other hand, a lot of people overreacted to the manual DMA situation of SPU programming. DMAs from main mem into SPU mem had a latency of.... 500 cycles! Once people put 2 and 2 together, SPU programming became less scary. Still a pain in the ass. And, a lot of work to reach peak performance. But, more approachable for sub-optimal tasks.

monocasa · on Oct 21, 2021

Also that damn load hit store penalty. PPEs and Xenon cores could only detect and stall if a read was for an address that was currently being flushed out of the write buffers. Most sane uarchs will bypass out of the store queue, but those cores needed to flush all of the way out to L1 and read it back in.

Also, IIRC, shift by variable amount was microcoded and would take a cycle for each bit distance shifted.

pcwalton · on Oct 21, 2021

> Also, IIRC, shift by variable amount was microcoded and would take a cycle for each bit distance shifted.

Wait, it was possible to ship a CPU in 2006 without a barrel shifter? ARM1 was from 1985!

If it was something silly like "only constant bit shifts use the barrel shifter", I'm surprised that compilers didn't compile variable shifts as a jump table to a bunch of constant shift instructions... :)

monocasa · on Oct 21, 2021

>Wait, it was possible to ship a CPU in 2006 without a barrel shifter? ARM1 was from 1985!

I was absolutely floored too.

And it made very little sense to me either. PowerPC has probably my favorite main ISA bit manipulation instructions out there: rwlimi

https://www.ibm.com/docs/en/aix/7.2?topic=is-rlwimi-rlimi-ro...

Any logical shift, rotate, bit field extract (by constant) and more all in one single cycle instruction that's been included since the earliest POWER days. They had a barrel rotater in the core for that instruction.

The only thing that makes any sense to me is that somehow it would have been too expensive to rig up another register file read port to that sh input, so they just pump it as many times needed with sh fixed to 1. They seemed to be on some gate count crusade that might have payed off if they were able to clock it faster at the end of the day. It took the industry a bit to figure out that ubiquitous 10Ghz chips weren't going to happen, and the hardest lessons would have been right in that design cycle. : \

> If it was something silly like "only constant bit shifts use the barrel shifter", I'm surprised that compilers didn't compile variable shifts as a jump table to a bunch of constant shift instructions... :)

Variable shift isn't the most common op in the world, so as far as I know it was just listed as something to avoid if you're writing tight loops.

jiggawatts · on Oct 21, 2021

The SPU approach is pretty much what a GPU does, but instead of 64 they have hundreds if not thousands.

I think the main issue with the Cell design was that it was too "middle road" and wasn't specialised enough in either direction.

unsigner · on Oct 21, 2021

It really isn’t - the individual elements in a GPU can access RAM directly, while you needed to wheelbarrow data (and code!) to and from SPUs manually. This was the major pain point, not the extreme (for that time) parallelism or the weird-ish instruction set.

pandaman · on Oct 21, 2021

It depends on what you mean by "access RAM directly", both SPUs and GPUs (at least on PS4) can read/write system memory. Both do this asynchronously. If you want random access to system memory you will die on a GPU even sooner than on an SPU since latencies are much bigger.

So there is not much difference imho and the GP is correct.

unsigner · on Oct 22, 2021

SPUs read memory asynchronously by requesting a DMA; GPUs read memory asynchronously too, but this is not explicitly visible to the programmer, and serious infrastructure is devoted to hiding that latency. The problem with SPUs was never that they are slow, rather than they are outright programmer-hostile.

pandaman · on Oct 22, 2021

>but this is not explicitly visible to the programmer , and serious infrastructure is devoted to hiding that latency

What do you mean? You can hide latency on SPU by double-buffering the DMA but in a shader there is no infrastructure at all and no way to hide unlike SPU, you just block until the memory fetch completes before you need the data.

> they are outright programmer-hostile

Depends on the programmer I guess, I enjoyed programming SPUs, don't know personally anybody who had complaints. Only read about the "insanely hard to program PS3" on the internet and wonder "who are those people?". It's especially funny because the RSX was a pitiful piece of crap with crappy tooling [+] from NVidia yet nobody complaining about SPUs mentions that.

[+] Not an exaggeration. For example, the Cg compiler would produce different code if you +0/*1 random scalars in your shader and not necessarily slower code too! So one of the release steps was bruteforcing this to shave off few clocks from the shaders.

pcwalton · on Oct 21, 2021

> The SPU approach is pretty much what a GPU does, but instead of 64 they have hundreds if not thousands.

Eh, only if you count every SIMD lane as a separate "core" like GPU manufacturer marketing does. More realistically, you should count what NVIDIA calls SMs, where the numbers are more comparable (GeForce RTX 3080 has 80, for example).

pandaman · on Oct 21, 2021

PS3 6 SPUs(1 is system reserved) 4-way SIMD

PS4 - 18 GCN CUs, each has four 16-ways SIMDs for 72 SIMDs but each is 4 times wider so PS4 has the same number of ways as in 288 4-ways SPUs.

jabl · on Oct 20, 2021

> I wonder if that university still uses their PS3 based supercomputer cluster

Almost certainly not. A typical lifetime of a supercomputer is around 5 years, give or take. After that the electricity they consume makes it not worth continuing to run them vs. buying a new one.

See e.g. the Cell-based Roadrunner, in use 2008-2013: https://en.wikipedia.org/wiki/Roadrunner_(supercomputer)

dragontamer · on Oct 20, 2021

5 is a bit young.

ORNL's Titan lasted about 7 years: 2012 through 2019. Its predecessor, Jaguar, was 2005 to 2012. Also 7 years.

jabl · on Oct 22, 2021

AFAIK both Titan and Jaguar received significant mid-life upgrades. So in such a scenario 7 years sounds reasonable.

(At a previous job, we had a cluster that was about 15 years old. Of course, it had been expanded and upgraded over the years, so I'm not sure anything was left of the original. Maybe some racks and power cables.. :) )

freemint · on Oct 22, 2021

Notice the N in National Labs, entirely different order of magnitude.

em500 · on Oct 20, 2021

The hype was probably mostly from "Crazy" Ken Kutaragi, who had a knack for dialing the Playstation hype beyond 11. The PS2 was already supposed to replace your home PC and revolutionize ecommerce and online gaming and plug yourself into the Matrix [1]. In the past he was compared to Steve Jobs by some, but in hindsight he seems more like Sony's Baghdad Bob.

[1] https://www.newsweek.com/here-comes-playstation-2-156589

monocasa · on Oct 20, 2021

At least the home computer part of that was a shtick to avoid some european import taxes. https://www.theguardian.com/technology/2003/oct/01/business....

I've heard on the grapevine that the PS3's OtherOS facility was internally thought of as another go at the same idea. "Look, judge, it's a general purpose computer for reals this time. Your own universities are using it in super computing clusters, without ever launching a game".

djmips · on Oct 21, 2021

Being forced kicking and screaming into the multicore on the PS3 improved our PC codebase by quite a good margin. So what I'm trying to say is that the hype was real, we have gone multicore, just not with the Cell architecture.

Dylan16807 · on Oct 21, 2021

Did the 360 not drag things into multicore very similarly? Pretending its CPU has one core is a pretty similar experience to pretending a PS3 has no SPEs.

jgtrosh · on Oct 20, 2021

Afaik these university projects were done to test out how Cell-based supercomputers would behave. They're not competitive with up-to-date Supercomputers.

In my lab I found a couple of PS3s lying around several years ago, that hadn't been used in quite a while. (One of them may or may not have been adopted for less scientific purposes …)

flipacholas · on Oct 20, 2021

If you are referring to the Barcelona Supercomputing Center, I think it's been decommissioned (their project is now archived https://web.archive.org/web/20090426190617/https://www.bsc.e...). Though this is expected in my opinion (Buy equipment -> Produce research -> Move to the next thing). IIRC their big thing is now the MareNostrum 4 supercomputer (https://en.wikipedia.org/wiki/MareNostrum).

mustacheemperor · on Oct 20, 2021

In 2012 the Air Force Research Laboratory built a supercomputer cluster from 1760 PS3s that I think was in practical use for a while[0]. I recall reading online that the Air Force struck a special deal with Sony to buy some of the last remaining PS3s that had not been updated to no longer support Linux installation during manufacturing, but that isn't mentioned in this source.

>The Condor Cluster project began four years ago, when PlayStation consoles cost about $400 each. At the same time, comparable technology would have cost about $10,000 per unit. Overall, the PS3s for the supercomputer's core cost about $2 million. According to AFRL Director of High Power Computing Mark Barnell, that cost is about 5-10% of the cost of an equivalent system built with off-the-shelf computer parts.

>Another advantage of the PS3-based supercomputer is its energy efficiency: it consumes just 10% of the power of comparable supercomputers.

I wonder how significant the cost and energy savings were by the time the project was finished, and how long the cluster was actually used.

[0]https://phys.org/news/2010-12-air-playstation-3s-supercomput...

Nbox9 · on Oct 20, 2021

PS3 clusters are limited in supercomputing tasks largely by RAM, each PS3 having 256MB which is anemic.

flatiron · on Oct 20, 2021

I can think of a ton of number crunching you can do where each node having 256 wouldn’t be awful. Like maybe brute forcing stuff where they can just segment the use case to each node. But it would certainly be a specialized case.

pjmlp · on Oct 21, 2021

Still, https://en.wikipedia.org/wiki/PlayStation_3_cluster

Rodeoclash · on Oct 20, 2021

I remember speculation at the time that the PS3 would be able to borrow from other cell powered appliances in your house when it was running, i.e. your toaster, to enhance its power.

Actually, I just did a quick Google on this and it was "Sony" themselves that appeared to mention this! [1]

[1] https://www.tomshardware.com/news/cell-broadband-engine-ps3-...

bitwize · on Oct 21, 2021

There was even a Penny Arcade comic about that: https://www.penny-arcade.com/comic/2002/08/07

pavlov · on Oct 20, 2021

Hmm, probably not. Seems like the biggest PS3 cluster had 1,760 units and was rated at 500 teraflops (single precision float, presumably).

An NVIDIA A100 GPU does about 20 teraflops. So you only need 25 of those chips to match the theoretical rating, and they have many other advantages like much higher memory per core, etc.

teruakohatu · on Oct 21, 2021

Have you got the NVIDIA numbers right? A NVIDIA DGX Station A100 (desktop/workstation sized computer with 4x A100 and draws 1.5 kW of power, is rated at 2.5 petaFLOPS, so a good 30+% more than the PS3 cluster.

Also the PS3 apparently drew up to 200w, so a cluster that size would have drawn 352 kW.

spijdar · on Oct 21, 2021

The units being used here are likely different. In most press releases Nvidia uses their "tensor core" performance, usually with either sparsity or 16 bit data. A single A100 is said to have 320 teraflops of "tensor float" performance but only 19 teraflops of "normal" full FP32 performance.

This is way out of my field so I don't know the whole implications, but my understanding is Nvidia cards cam only reach these speeds at the loss of precision or full functionality, so it's an apples to oranges comparison versus non-nvidia chips.

urthor · on Oct 21, 2021

People forget that this was 2006. CUDA was not invented yet. And arguably didn't take off until Alexnet.

NVIDIA had not even released the API for writing vectorised C code yet.

Hence all those 2007 supercomputer stories. They actually had a genuine use, because the Cell was fully implemented into the Linux kernel by IBM.

These guys did incredibly well for their time, and they were entirely right about vectorised code.

They were just superseded by the longer term trend of tying together multiple silicon dies and ASICs for HPC.

bcatanzaro · on Oct 21, 2021

CUDA was released in beta in November 2006, so it was invented by this point, even if it wasn’t well known.

pcwalton · on Oct 21, 2021

I mean, the Cell SPU design is everywhere, it's just now part of every GPU and called compute shader. Vector processors are a good idea, but they have to be accessible to programmers through standard, easy-to-use, ideally-cross-platform APIs or invariably they won't get used to their full potential.