This argument is less effective given that SIMD is not always a straightforward substitute for vector processing. Sometimes we want 128, 256 or 512 bits of processing as a unit and will follow it up with something different, not a repeated instance of that same process.
We also used SIMD quite extensively as a 'wider GPR' - not doing stuff over tons of input characters but instead using the superior size of SIMD registers to implement things like bitwise string and NFA matchers.
A SIMD instruction can be a reasonable proxy for a wide vector processor but the reverse is not true - a specialized vector architecture is unlikely to be very helpful for this kind of 'mixed' SIMD processing. Almost any "argument from DAXPY" fails for the much richer uses of SIMD processing among active practitioners using modern SIMD.
I went on a bit of a research expedition to see if there was something that scaled better a general permutation instruction for SIMD machines. General permute scales at O(log(N)) in gate delay+ and O(N^2 * log(N)) in area, where N is the vector length. Its a full crossbar, but the fanout on the wires adds an additional log(N) factor in buffers.
For a while, it seemed like a set of instructions based on a rank-2 CLOS network (aka butterfly network) would get the job done. It scales at O(log(N)) in gate delay+ and O(N * log(N)) in area, and is very capable. Fanout is O(1). You can do all kinds of inter- and intra-lane swaps, rotations, and shifts with it. You can even do things like expand packed 16-bit RGB into 32-bit uniform RGBA.
But things like that SMH algorithm are definitely out of scope: Each input bit can only appear in at most one output location with the butterfly. So the cost to replicate a prefix scales at O(repetitions), which is unfortunate. Some algorithms based on using general shuffle are also relying on the use of PSHUFB's functionality as a complete lookup table, which the butterfly network can't do, either.
My conclusion was that you're basically stuck with a general permute instruction for a modern SIMD ISA, scaling be damned.
+ The latency scaling is somewhat misleading thanks to wire delay - they are both O(N) in wire delay.
I completely agree in principle (some applications are good fit for traditional long vector processing, but it's a terrible fit for others).
However the RISC-V Vector Extensions basically let you use the processor in "SIMD mode" by setting the vector length to a small value. It will depend on the processor architecture whether a vector length of e.g. 4 is efficient or not, but I expect for many implementations it will be relatively efficient (and you can definitely just use it as a "wider GPR" if you want to).
The only catch at the ISA level is it costs an instruction to change the vector size. So if you keep swiching between e.g. 128-bit and 512-bit instructions at a very fine granularity, that might add overhead... I'm not sure that's a very common case though?
I agree. It seems like this strategy makes only the most braindead applications of SIMD better (simple loops that can be vectorized by an arbitrary factor), but doesn't really do anything to help the meatier SIMD workloads. Most SIMD code isn't as simple as this, and the code that is usually isn't a significant factor in either developer experience or runtime.
Seriously. A lot of these proposals go veering off into second-order considerations ("Easier to decode!" "A few picojoules less energy") as I'd be very surprised if the bottlenecks are going to be from SIMD vs vector architecture ISA issues - as compared to, say, memory bandwidth or multiply-add bandwidth.
A few years ago I tried to buy a liquid cooled overclocked server for trading. Enabling AVX cost extra due to the concentrated heat output from the MMU and each core's vector unit.
It was along the lines of being able to get a server that was tested stable at 5GHz without AVX vs 4.5 GHz with AVX for the same price.
So at least on Intel, these vector units are apparently limiting clock speeds and yields due to power consumption.
Agreed, but then again, the vector architectures have the undeniable benefit of being agnostic to the underlying hardware.
With a SIMD architecture, as registers get ever wider you need to recompile your code, and binary releases require multiple code paths.
With a vector architecture, the hardware designers can just increase the machine's internal vector size and existing code will benefit immediately.
Furthermore, it seems reasonable to count GPU users among "active practitioners of modern SIMD". All code that is written in a GPU-style -- which admittedly isn't too common yet on CPUs, but it would certainly be feasible and is possible since ispc came along -- immediately becomes a beneficiary of a well-designed vector instruction architecture.
Yes, recompiling can be a pain, but there's an argument that we should get used to recompiling things from source and make sure this path is smooth, rather than trying making old binaries work as long as possible. Static binaries are another reason.
You lose binary portability though: what is you can write you image processing code such that it operates well whether your SIMD instructions is 4, 8, or 16 wide? (or more).
Then shipping this binary for mobile phones for instance would take advantage of the wider register available on high-end device while working well on more modest CPUs.
Advanced compute shaders very often need to synchronize local data within warp. When you program such shaders you have to be aware about SIMD width (32 on NVidia and Intel, 64 on AMD), and design both algorithms and data structures accordingly. Failing to do so often have significant performance costs, I saw up to 10x speedup after implementing cooperative algorithm instead of straightforward one.
I have done a bit of simd programming in past & I can tell you that its quite common to expect to rewrite your simd optimized code as a new architecture come along. sometimes you do it even as you go between different iterations of the same arch (like cortex-A8 vs A9) because of different instruction timing (and sometimes bugs). In general, asking hardware to auto-optimize around your code doesn't works except for simple problems & even then you are likely leaving performance on table.
What I really want is a lots of different types of vector & permute ops and the ability to reconfigure the simd unit on a dime when I am done with a specific type of compute (like crypto) and switch to another type (like signal processing).
Well, that makes sense if you want to squeeze the last bit of performance out of highly optimized code.
However, there's a case to be made that, in order to utilize our hardware better, we should be using vector units much more often. To make that feasible, we need a good programming paradigm that doesn't have to be rewritten for a different architecture. If that ends up not utilizing the hardware perfectly, that's okay: using a 256-bit vector unit even at 50% of the potential performance is still many times faster than scalar code.
GPU Coders haven't changed their code in the last 10 years, even as NVidia changed their architecture repeatedly.
PTX Assembly from NVidia still runs on today's architectures. I think this variable-length issue they focus on so much is a bit of a red-herring: NVidia always was 32-way SIMD but the PTX Code remains portable nonetheless.
The power is that PTX Assembly (and AMD's GCN Assembly) has a scalar-model of programming, but its execution is vectorized. So you write scalar code, but the programmer knows (and assumes it to be) in a parallel context. EDIT: I guess PTX is technically interpreted: the number of registers is not fixed, etc. etc. Nonetheless, the general "SIMD-ness" of PTX is static, and has survived a decade of hardware changes.
There are a few primitives needed for this to work: OpenCL's "Global Index" and "Local Index" for example. "Global Index" is where you are in the overall workstream, while "Local Index" is useful because intra-workgroup communications are VERY VERY FAST.
And... that's about it? Really. I guess there are a bunch of primitives (the workgroup swizzle operations, "ballot", barrier, etc. etc.), but the general GPU model is actually kinda simple.
-----------
I see a lot of these CPU architecture changes, but none of them really seem to be trying to learn from NVidia or AMD's model. A bit of PTX-assembly or GCN Assembly probably would do good to the next generation of CPU Architects.
I'm really not sure something like ptx is really the answer.
PTX isn't really native and is JIT compiled into native by the gpu driver. In many high performance apps like games native compiled versions are included for several architectures.
GPU programming model is simple because it's limited. Low single-threaded performance, high latency masked by threads switching, inefficient branching, limited memory model (can't allocate RAM, write access is very limited).
If you're happy with these limitations, write OpenCL code and run it on CPU. Will work much faster than scalar code but likely slower than a GPU would.
> Low single-threaded performance, high latency masked by threads switching,
Those are fundamental design tradeoffs of the GPU architecture. Not of PTX Assembly language.
> inefficient branching
Ehh? I disagree strongly. GPUs branch very efficiently, considering its within a SIMD-environment. Modern GPUs can arbitrarily diverge execution paths and reconverge as necessary. Its far more advanced than any SIMD I've seen implemented on a CPU.
GPUs have the most efficient way to branch when you're managing 32 or 64 work items at a time.
> limited memory model (can't allocate RAM, write access is very limited).
You can definitely allocate RAM on a GPU. That's what the CudaMalloc function does. What do you mean by "limited write access" ??
---------
Programmers don't want to "think" in SIMD. They want to think in terms of scalar code. Per-lane Gather/Scatter is far easier to think about than vpshufb (even though vpshufb is effectively a gather instruction over an AVX register)
That's the difference. GPU Assembly language is almost equivalent to CPU stuff, its just "seen" in a different light.
CPUs are missing very, very few assembly instructions before they can run like a GPU. CPUs simply need a hardware accelerated branching-mechanism to help handle the divergent flow graphs (a simple advancement over bitmasks... AMD GCN is a good example. NVidia's Volta/Turing fully divergent flow would be nicer, but that requires a program-counter per SIMD-lane).
AMD's GCN architecture implements that mostly in just two assembly instructions: S_CBRANCH_FORK and S_CBRANCH_JOIN.
Intel is missing the "scatter-equivalent" of vpshufb. GPUs can gather/scatter arbitrarily between SIMD lanes, but Intel AVX512 only has "SIMD-gather" in the "vpshufb" instruction. Add "SIMD-scatter" and AVX512 would be complete. AMD GCN calls the two instructions "permute" and "bpermute" (backwards permute). So maybe a "backwards-vpshufb" or "vpbshufb" instruction is all AVX512 needs.
Finally, allow SIMD-instructions on CPUs to pipeline in more complex manners, to increase utilization. Only guarantee that they complete through the use of a barrier instruction. This one is an architectural shift, but GPUs are aware of "larger" workgroup sizes. (AMD GCN natively executes 64-at-a-time. But it can grow thanks to the barrier-mechanism).
The "variable length" discussed in the article here is backwards. GPUs grow their workgroups bigger, and then the programmer creates synchronization points with barrier instructions. Its actually a very simple model to work with.
Bam. Those 5 or 6 assembly instructions would make CPU-programming effectively on the same level as GPU programming.
> Those are fundamental design tradeoffs of the GPU architecture. Not of PTX Assembly language.
PTX is not general purpose, it was specifically designed for that architecture, and it incorporates these tradeoffs.
> far more advanced than any SIMD I've seen implemented on a CPU.
Less advanced than normal non-SIMD branching available to CPU cores. A lot of practical algorithms need both SIMD compute and scalar branching.
> That's what the CudaMalloc function does.
CudaMalloc function can't be called from GPU code. You can't allocate RAM at all from inside your algorithm, there's no stack, no heap, nothing. You have to know in advance how much RAM do you need. For some practical problems this is a showstopper, e.g. try to implement unzip in CUDA.
> What do you mean by "limited write access"?
You only have fixed-length arrays/textures. Global memory barriers are very slow. There're producer-consumer buffers but practically speaking they're merely thread safe cursors over statically sized buffer.
GPUs can multiply dense matrices very fast, but many practical problems are different, e.g. for sparse matrices GPUs don't deliver value performance wise. SIMD on CPU is often very useful for such problems, but yes, programming model is different, lower level and more complex. No free lunches.
> Programmers don't want to "think" in SIMD.
I'm a programmer and I like SIMD.
> even though vpshufb is effectively a gather instruction over an AVX register
On GPU it would be because you don't care about latency you just spawn more threads.
On CPU it's not because shuffle_epi8 is 1 cycle latency instruction, and RAM access is much slower, if you'll think they're equivalent you'll miss the performance difference.
> CPUs are missing very, very few assembly instructions before they can run like a GPU
Even if you'll add these few instructions, GPUs will still be much faster, by orders of magnitude. Hardware is too different but it's not the instructions, CPU is spending transistors minimizing latency (caches and their sync, branch prediction, speculative execution, etc.) GPUs don't care about latency.
It's not instruction set that allowed simple programming model on GPUs. It's fundamentally different tradeoffs in hardware.
I think I overcomplicated my previous post. Lemme cut back the cruft and simplify. I think CPUs (Specifically AVX512) should implement the following instructions:
1. Barriers and Workgroups -- Scale SIMD UP, not downwards. The variable-length vector (discussed in this article) is backwards to the current programming model. GPU Programmers combine lanes with the concept of a OpenCL workgroup or CUDA Thread Block, and it works pretty well in my experience.
2. Implement AMD GCN-style branching with S_CBRANCH_FORK and S_CBRANCH_JOIN. This will accelerate branching when SIMD-lanes diverge in execution paths.
3. Implement "backwards vpshufb". GPUs can gather or scatter values between lanes, while CPUs can only gather data between lanes (with vpshufb). Intel AVX512 is missing an obvious and very important instruction for high-speed communication between SIMD lanes.
I agree these changes would be nice. OpenCL and similar would probably work faster on CPUs with these instructions.
I’m just not sure it’s worth it. People who are OK with GPU programming model are already using GPUs because way more powerful. AVX-512 theoretical max is 64 FLOPs/cycle, a modern $600 CPU i7-7820X with good enough cooling is capable of 1.8 TFlops single precision. A generation old $600 GPU 1080Ti is capable of 10.6 TFLops. Huge difference.
> GPUs can gather or scatter values between lanes
Can they? AFAIK they can only permute values between lanes but not scatter/gather.
__shfl_sync() in CUDA does exactly the same as _mm256_permutevar8x32_ps() in AVX2 or _mm512_permutexvar_ps in AVX512.
> Can they? AFAIK they can only permute values between lanes but not scatter/gather.
GPUs have a crossbar between OpenCL __shared memory and every work-item in a workgroup. Its so innate to the GPU that its almost implicit. In terms of OpenCL, the code looks like this:
The above is the equivalent to a "backwards vpshufb". AVX512 is missing this equivalent.
GPUs don't really "need" a a dedicated instruction to do this, because __local memory is a full-speed crossbar when workgroups are of size 32 (NVidia) or 64 (AMD), the native SIMD size. Its performance characteristics are not quite equivalent to vpshufb, but its still very, very, very fast in practice.
> __shfl_sync() in CUDA does exactly the same as _mm256_permutevar8x32_ps() in AVX2 or _mm512_permutexvar_ps in AVX512.
There's little reason to use __shfl_sync(), because it goes through CUDA Shared memory anyway. Ditto with AMD's __amdgcn_ds_permute() and __amdgcn_ds_bpermute() intrinsics.
EDIT: I guess __shfl_sync() and the __amdgcn_ds_permute / bpermute instructions save a step. They're smaller assembly language and more concise. But I expect the overall performance to not be much different from using LDS / Shared Memory explicity.
------------------
> I’m just not sure it’s worth it. People who are OK with GPU programming model are already using GPUs because way more powerful. AVX-512 theoretical max is 64 FLOPs/cycle, a modern $600 CPU i7-7820X with good enough cooling is capable of 1.8 TFlops single precision. A generation old $600 GPU 1080Ti is capable of 10.6 TFLops. Huge difference.
People don't program CPUs for FLOPs. They program on CPUs for minimum latency.
SIMD Compute is useful on a CPU because it stays in L1 cache. L1 cache is 64kB, more than enough to have a good SIMD processor accelerate some movement. CPUs even have full bandwidth to L2 cache, which is huge these days (512kB on EPYC to 1MB on Skylake-server)
CPU-based SIMD won't ever be as big or as broad as GPU-based SIMD. But... CPU-based SIMD should become easier as Intel figures out how to adopt OpenCL or CUDA programming paradigms.
There are already many problems implemented in CPU-AVX512 which execute faster than 15.8GB/s that a PCIe x16 bus will give you. Therefore, its more efficient to execute the whole problem on the CPU, rather than transfer the data to the GPU.
The data exchange is performed between registers, and more efficient than going through shared memory, which requires a load, a store and an extra register to hold the address.
If you count shared memory scatter/gather, CPU SIMD already have both. Scatter very recently so, only appeared in AVX512. Gather is available for 5 years now, _mm256_i32gather_ps was introduced in AVX2, albeit it's not particularly fast.
> They program on CPUs for minimum latency.
Not just that. I code for CPU SIMD very often, and only occasionally for GPGPU. Even for code that would work very well on GPUs. The main reason for me is compatibility. I mostly work on desktop software, picking CUDA decreases userbase by a factor of 2 which is often not an option. But yeah, another reason is that CPU SIMD is fast enough already and spending time on PCIx IO doesn't pay off.
Update: another reason why I don't code GPGPU more is different programming model. GPU programming model makes writing device code easy, and like you mentioned earlier it even has good scalability built-in i.e. in many cases compute shaders need not to be aware of the warp size.
But the downside is upfront engineering costs.
I have to keep my data in very small number of continuous buffers. I have to upload these buffers to GPU. I have to know in advance how much VRAM do I need for output data.
I find this part much easier for CPU SIMD, on CPU I only need to design the lowest level of my data structures accordingly, but I can use anything at all on higher levels of the structures: hash maps, trees, linked graphs, they all work just fine, as long as their lower-level nodes are not too small, aligned, dense, and composed of these 128/256 bits SIMD vectors.
> If you count shared memory scatter/gather, CPU SIMD already have both. Scatter very recently so, only appeared in AVX512. Gather is available for 5 years now, _mm256_i32gather_ps was introduced in AVX2, albeit it's not particularly fast.
You can say that again. Its still not very fast btw.
VPSCATTERDD ZMM-register is measured to be 17-clock cycles (!!!), while VPGATHERDD ZMM-register is 9-clock cycles. Gather/scatter on CPUs is very, very slow!
Its actually faster to gather/scatter through a for-loop than to actually use the VPGATHERDD or VPSCATTERDD instructions.
In contrast, Shared Memory on GPUs is a full crossbar on NVidia and AMD.
-------------
I think my details are getting a bit AMD specific. Lemme do a citation:
> As a previous post briefly described, GCN3 includes two new instructions: ds_permute_b32 and ds_bpermute_b32 . They use LDS hardware to route data between the 64 lanes of a wavefront, but they don’t actually write to an LDS location.
The important tidbit is that the Load/Store units of AMD's GPU Core can support a gather/scatter to separate LDS memory banks at virtually no cost. This is a full crossbar that allows GPUs to swap lanes between SIMD registers in different lanes.
Intel CPUs do NOT have this feature, outside of vpshufb. I'm arguing that GPU Shared memory is of similar performance to vpshufb (a little bit slower, but still way faster than even a CPU's Gather/Scatter).
So yes, the "bpermute" and "permute" instructions on AMD are a full-crossbar and can execute within 2-clock cycles (if there are no bank conflicts). That's 64-dwords that can be shuffled around in just 2-clock cycles.
In contrast, Intel's Gather/scatter to L1 cache is 9-clocks or 17-clocks respectively.
-----------
The important thing here is to do a high-speed permute. The programmer can choose to use vpgatherdd, vpshufb, GPU permute, GPU bpermute, GPU LDS Memory, etc. etc. It doesn't matter for program correctness: it all does the same thing.
But GPUs have the highest-performance shuffle and permute operators. Even if you go through LDS Memory. In fact, the general permute operators of AMD GPUs just go through LDS memory, that's their underlying implementation!
> while VPGATHERDD ZMM-register is 9-clock cycles.
Quite expected because RAM in general is much slower than registers. Even L1 cache on CPU / groupshared on GPU. On both CPUs and GPUs, you want to do as much work as possible with the data in registers.
The reason why it’s OK on GPU is massive parallelism hiding latency i.e. the hardware computes other threads instead of just waiting for data to arrive. But even on CPUs, if the requested data is not too far (in L1 or L2), hyperthreading in most modern CPUs does acceptable job in this situation.
> Intel CPUs do NOT have this feature, outside of vpshufb
Yes they have. If you prefer assembly, look vpermps instruction. It can permute lanes arbitrary even across 128-bit lanes (this is unlike vpshufb/vshufps/etc.), with permutation indices being taken from another vector register. Quite fast, specifically on Haswell/Broadwell/Skylake it’s 3 cycles latency, 1 cycle throughput, single micro-op.
Stream Compaction can be used to remove redundant whitespace from strings, or to "compress" the raytracer rays so that they are all able to be read by a simple load (as opposed to a gather/scatter operation). How do you perform stream compaction?
Now, how do you do this using vpshufb? You can't. vpshufb is backwards, and won't efficiently solve this stream compaction problem. GPUs can solve this problem very efficiently, but Intel's current implementation of AVX512 is missing the "backwards" vpshufb command, to perform this operation.
Or as I've been trying to say: vpshufb is equivalent to a "gather" over SIMD Registers. But Intel is MISSING a scatter over SIMD Registers. The instruction is just... not there. I've looked for it, and it doesn't exist. As such, GPU-code (such as the stream compaction algorithm) CANNOT be implemented efficiently on a CPU.
I mean, you can use vpscatterdd to implement it, but again... vpscatterdd is 17-clock cycles. That's way too slow.
> Quite expected because RAM in general is much slower than registers. Even L1 cache on CPU / groupshared on GPU. On both CPUs and GPUs, you want to do as much work as possible with the data in registers.
The L1 cache has the bandwidth to do it, it just isn't wired up correctly for this mechanism. Intel's load/store units can read/write 32-bytes at a time, in parallel across 2xload units + 1x store unit.
But the thing is: writing many small "chunks" of 4-bytes here and there (ie: a vpgatherdd / vpscatterdd operation) is not a contiguous group of 32-bytes. Therefore, Intel's cores lose a LOT of bandwidth in this case.
GPUs on the other hand, have a great-many number of load/store units. Effectively one per SIMD unit. As such, reading / writing to LDS "shared" memory on AMD GPUs can be done 32-at-a-time.
So the equivalent "vpgatherdd" over LDS cache will execute in something like 2-clock ticks on AMD GPUs (assuming no bank conflicts), while it'd take 9-clock ticks on Intel cores.
Again, LDS cache is so fast on GPUs, that it is effectively functioning as if any LDS-load/store is as fast as a vpshufb instruction. (Not quite: vpshufb is 1-clock tick, and doesn't have to worry about bank-conflicts. So vpshufb is still faster... but GPUs gather/scatter capabilities are downright incredible)
How long before you think Intel will implement a true crossbar so that the vpgatherdd and vpscatterdd instructions can actually execute quickly on a CPU?
-----------
GPUs actually implement a richer language for data-movement. Intel could very easily fix this problem by writing a "backwards vpshufb" instruction, but I'm not aware of anything that exists like that... or any plans to implement something like that.
> Now, how do you do this using vpshufb? You can't.
You keep mentioning vpshufb despite it's unable to move data across 128 bit lanes.
Here's how to do that with vpermps, there's some overhead but not much, very likely much faster than RAM access even when in L1: https://stackoverflow.com/a/36951611/126995
Besides, new CPUs have AVX-512 that has vcompressps instruction just for that use case.
> The L1 cache has the bandwidth to do it
It has bandwidth, but there's extra latency involved. Registers are faster.
It's the same on GPU, that's why nVidia recommends these CUDA permute lanes intrinsics over scatter/gather RAM access.
> that doesn't have to be rewritten for a different architecture
CPU architectures are quite stable. SSE2 is almost 20 years old now. You can't even run modern Windows on a system which doesn't support it.
Vectorize to SSE and you'll get your 50% of potential performance. You can do it without any new paradigms, C and C++ support SSE intrinsics for decades already, other languages are catching up.
I thought it was obvious, but apparently I wasn't clear enough: what I meant with the need for rewrite is the need for rewrite to benefit from architectural changes.
If you wrote SSE2 code 20 years ago, then yes, it's still going to run. But it's not going to benefit from the advances made with AVX and AVX2.
Compare this to the GPU model, or ispc or vector instructions, where you automatically benefit from a wider machine without a rewrite of your code (and in the case of vector instructions even without a recompile).
If you want to do an analogy with GPU compute shaders, I think it is more accurate to compare to how GPUs can scale the number of cores without (potentially) the need to recompile, as long as enough blocks are scheduled.
This is orthogonal to the fact that these are warp-size aware I believe.
Sounds like you're getting fixated on the example. Short vectors would do what you want. And it's a lot easier to add specific vector instructions for particular use cases than adding N of them for each use case (for N simd sizes).
Are you getting hung up on the term 'vector'? That doesn't assume you're just doing linear algebra.
No he's not getting hung up on the term vector. He is saying that SIMD is fixed but wider width and that is not equivalent to arbitrary width. For example, SIMD shuffle type instructions have no arbitrary length equivalent.
How would you make an arbitrary shuffle between arbitrary-lengthed vectors faster than an arbitrary gather/scatter to and from L1 Cache?
Its possible to have a chip code specialized transforms or have a many-to-many crossbar when N is small (ex: 16), but when N is large (ex: 1024 elements), its no longer easy to see how to build a high-speed permute operator.
-----------
That's the thing. People only use vpshufb because its WAY faster than L1 cache. If Intel made a faster gather/scatter, there wouldn't be much point to the vpshufb instruction. But the vpshufb instruction is so fast, because its so specialized and small. It only has to worry about a 16-byte permute.
In short: we ALREADY have an arbitrary permute instruction. Its called gather/scatter. That's not what programmers want however. (I mean, programmers want a faster gather/scatter... but... vpshufb programmers use that operator only because its faster than L1 cache)
There's a front-end operating cost to the bookkeeping instructions, and a complexity cost to all the variants of the same instructions. For short vectors, the cpu can use the same hardware it uses today in SIMD, just that the SIMD work is in microcode instead of asm. The cost of fetching the permutation vector arg out of L1 isn't terribly high compared to the cost of fetch/decode on the bookkeeping instructions. And the cost of supporting all those instruction variants could be replaced with more functionality the front-end.
Seriously. Permutes get harder as you scale - VBMI on CNL is an indicator that 64-way is pretty good but it's still considerably more expensive than 4 16-way permutes on the same architecture.
There's a reason that gather is hard to do; I think if you rocked up and asked the architecture guys for a gather that was competitive with small-scale permute they would reply with the time-honored Intel putdown ("You are overpaid for whatever it is you do").
Hey, you can have 7-bit lookup tables at the byte level on AVX512VBMI (using the 2-register shuffle forms) and you can already have 6-bit lookups with 2-register 16-bit shuffles if you can play around on Skylake Server.
Mass availability of the VBMI goodies looks to be bottlenecked behind Icelake/Sunny Cove, so you'll have plenty of time to think through the implications of fast 6-bit lookup. :-)
Yeah, other examples include image codecs, such as JPEG: the DCT performed on the 8x8 blocks can benefit from SIMD, but the lanes aren't independent at all (matrix transposes, various intra-block additions).
I believe the blocks end up stored as an array-of-structs where the structs have 8*8 = 64 elements. Doing the DCT in multiple blocks requires somehow transposing this into a struct-of-arrays-like format, maybe a gather of every 64th element (likely a waste of memory bandwidth) or some sort of unpckl/unpckh-like instructions. Either way, this may impose non-trivial overhead, and so the benefits of extra parallelism are hidden.
(And, of course, that's all assuming there's enough registers, and I don't remember enough about JPEG to make a guess.)
Only if these multiple blocks fit in the 16 registers. If they won't fit and the data will be evicted to RAM, that extra parallelism will slow down the code, not speed up.
In a vector arch like the risc-v vector extension, if you want to process exactly 256 or whatever at a time, just set the configuration or vector length register and off you go with SIMD-style programming?
It is very unlikely that such a configuration will perform remotely as fast as a native SIMD implementation, unless there is some truly heroic specialization going on under the hood. Obviously the work is still possible with vector ops, in the same way that it's still possible with scalar ops too. But will it be fast? My guess is no.
> It is very unlikely that such a configuration will perform remotely as fast as a native SIMD implementation
Why? AFAICS, a "real" vector ISA is mostly a superset of a SIMD ISA. What do you think is missing?
Of course, if you want to eke out the absolute maximum performance then you need to be aware of the microarchitecture you're targeting, and a length-agnostic vector ISA like SVE or RVV don't buy you that much. But I don't see how that's worse than having to redo your code whenever the vendor introduces new HW with a new SIMD ISA.
I guess one could argue that an implementation of a vector ISA targeting, say, linear algebra, would be different than an implementation focusing on maximum short-vector performance. Say, by having a vector length >> execution width, and using tricks like vector pipelining etc. to get performance rather than focusing on minimizing short-vector latency. But, that's a question of what the implementation is optimized for rather than saying what the ISA is good for, no?
In some recent work from my group [1], we reduce the complexity of keeping up with new SIMD ISAs by retargeting code between generations. For example, a compiler pass can take code written to target SSE2 (with intrinsics) and emit AVX-512 - it auto-vectorizes hand-vectorized code. With a more capable compiler, if the ISA grows in complexity, programmers and users of libraries get speedups without rewriting their code or relying on scalar auto-vectorization. However, the x86 ISA growth certainly pushed some complexity on us as compiler writers - we had to write a pass to retarget instructions!
Recently a patch was contributed to gcc that converts mmx intrinsics to sse. Also the gcc power target supports x86 vector intrinsics, converting them to the power equivalents.
It's not as ambitious as your approach though, more like a 1:1 translation and thus cannot take advantage of wider vectors.
That patch primarily is there to avoid the pitfalls of MMX on modern architectures; it is gradually becoming deprecated. On SKX, operations that are available on both ports 0 and 1 for SSE or AVX are only available on port 0 for MMX. So code that uses MMX is getting half the throughput (which may or may not matter, but still).
Thanks for the explanation, I wasn't aware of the reasoning behind it. I would guess by now all actively maintained performance-critical code has been rewritten in something more modern, so it certainly makes sense for Intel to minimize the number of gates they dedicate to MMX.
There is probably a lot of merit in the advantages of vectors but it weakens the article to set them up as against SIMD when the presented facts are dubious at best:
> An architect partitions the existing 64-bit registers
> The IA-32 instruction set has grown from 80 to around 1400 instructions since 1978, largely fueled by SIMD.
Wait, what. IA-32 started in 1985 not 1978. It didn't have any existing 64 bit registers. It was called IA-32 because of the 32 bit registers, like EAX and EBX. And then looking at the 1986 reference manual https://css.csail.mit.edu/6.858/2014/readings/i386.pdf I count 96 instructions under 17.2.2.11. The IA-32 instruction set didn't grow much all these years, IA-64 did to the best my knowledge but please let me know if I am wrong here. As for IA-64, I looked at https://www.intel.com/content/dam/www/public/us/en/documents... and it's hard to get an accurate count because some instructions are grouped together, it's either 627 or 996 (and I may have made a counting mistake given I started from a PDF, but it should be close) which is indeed very high but even our best attempt only finds a tenfold growth (and perhaps only a 6.5) instead of the 17.5 the article suggested.
Roll my eyes every time I see a "Considered Harmful" headline.
As for SIMD, it's a huge benefit when used in the right context. I applied it image processing and video compression algorithms in the past, with significant performance gains.
> Roll my eyes every time I see a "Considered Harmful" headline.
Me too.
> As for SIMD, it's a huge benefit when used in the right context.
Sure, but as the article points out, that benefit could be even huger when used on a "proper" vector architecture with veeeery wide vector registers that do not also double as not-very-wide scalar registers. "The SIMD instructions execute 10 to 20 times more instructions than RV32V because each SIMD loop does only 2 or 4 elements instead of 64 in the vector case."
I think adding SIMD instructions to x86 was a good trade-off at the time, but I also think the authors are correct that new ISAs designed now are better off with a vector architecture like they propose. In the end it's apples vs. oranges because the two contexts are not comparable.
I feel like even though full vector architecture might perform a lot better, the use cases may be much narrower than SIMD, especially on typical desktop or server (web applications etc, not scientific computing, deep learning or image processing -- many of which are already vectorised on GPUs) workloads. As others have mentioned, SIMD allows you to do a little bit of vectorisation in an otherwise non-vector workload, or use it for the wider registers or whatever. I don't know enough about it personally to be able to judge either way, though. I just know that I've attempted to vectorise some hobby game code for fun a few times and typically found it much harder to achieve than it first seemed, even though the data seemed trivially vectorisable at first. Perhaps that's just lack of experience.
"Harmful" doesn't mean "so dead in the water that it doesn't actually have performance gains". Intel wouldn't have integrated SIMD if it didn't work at all; that's not the criticism. Rather, the claim is that the performance gains are not well justified by the technical debt they bring to the architecture. The performance gains are not as good as they could be with the vector approach, which has better dynamic and static code density, doesn't horribly proliferate the instruction set, and allows for tuning without recompilation of code to use different instructions.
For those who don’t get the title reference: ”[Programming Pattern] Considered Harmful” is a title meme that began in 1968 — over fifty years ago — and still going strong today!
It all started with the famous Edsger Dijkstra paper titled “Go To Statement Considered Harmful” [1], which lead to an endless series of subsequent papers later patterned after its title [2].
I agree with the sentiment that this title pattern is overused currently, to the point of being cliche — with perhaps a bit of presumptuousness as well, due to the implicit suggestion that the author’s claim of “Considered Harmful” will withstand the test of time as well as Dijkstra’s paper (though perhaps I’m reading too much into it, in this case).
> Roll my eyes every time I see a "Considered Harmful" headline.
Same. Please, when you write a title for your article, be specific about what you think is wrong. "X Considered Harmful" is lazy writing and devalues any substantive argument you make because it's become a cliche to the point where it's almost anti-clickbait.
GUILTY ADMISSION: a related trap I started falling into when writing content for work (email, documents, whatever) and I was in a hurry for a subject line or title was "Thoughts on X". That's right, "thoughts": I communed with the creator and distilled them from on high. Yeah, ain't nobody got time to read that.
This is one of the few times when a Considered Harmful article deserves the title. Did you see the RISC-V assembly code? Holy shit! It's so much nicer than every other SIMD ISA I've ever used! So much easier to program, so much easier to write compilers for -- and it should also be easier for CPU designers to handle efficiently. What's not to like?
> should also be easier for CPU designers to handle efficiently
I'm always dubious of claims like this. I think in general the architecture would need to turn it back into the highest SIMD code it supports at instruction decode.
So, if I understand it correctly, the text argues in favor of the GPU approach of pipelining independant vector operations instead of the current SIMD approach.
I see how this could be beneficial, specially when writing codes, as it's way closer to just a normal loop.
Then again, why not combine both ideas and pipeline chunks of SIMD type? Say we have 4 execution stages and 32bit SIMD types (unrealistic, I know) and want to process 8-bit numbers. Wouldn't we be able to process 16 of them at the same time? Actually, isn't that kind of what GPUs already do?
I'm sure smarter people than I have reasoned about this, maybe someone can link a good article. I only know of this one [1] and one about GPGPU that I just can't find any more (but which was also very interesting)
By coincidence I started a new blog a few days ago and my first article is about the SIMD Instructions Considered Harmful post from a power efficiency perspective... maybe I should post it separately on HN? :)
I think I'm kinda explaining how it's similar (and different) to what modern GPUs do but I'm not sure I understand what you mean by "wouldn't we be able to process 16 of them at the same time" - do you mean a throughput of 16/clock, or just that 16 are "in flight" through the pipeline with a throughput of 4/clock?
I'm not sure I'm clear enough about it for those without a GPU HW background. If it's not clear I'm happy to write down a more detailed explanation here!
> Wouldn't we be able to process 16 of them at the same time?
Yes you would if you had 4 execution ports available and no data dependencies. Of course, those execution ports could also be processing 256 bit wide SIMD registers instead of just 32 bits. So it's a bad idea.
Instruction count is also higher, which is never a good thing.
> Actually, isn't that kind of what GPUs already do?
Maybe CPU architectures should just have data-parallel loop support of arbitrary width. The CPU can implement it in microcode however it feels like, or perhaps a kernel can trap it and send it off to a GPU transparently.
Strikes me as much cleaner design-wise than stuff like CUDA or openCL or SIMD of today.
This sounds quite optimistic. How would microcode deal with allocating registers, or nested data parallelism? You are describing transformations that usually happen fairly early in compiler optimization pipelines, and pushing that down to microcode would bring huge complexity.
IIRC the Mill CPU handles this by performing a translation at install time.
For Mill CPU variants with wide vector units the CPU could execute certain instructions in one go, while for variants with narrow units it might have to issue multiple instructions.
Their idea is to handle this by basically doing ahead of time compilation of a generic program image, turning it into a specialized version for the installed CPU.
This sounds like the claims that Intel made for the Itanium and its EPIC instruction set when Itanium did not yet exist. The rest is history.
All of the following quotes taken from
> Their idea is to handle this by basically doing ahead of time compilation of a generic program image, turning it into a specialized version for the installed CPU.
"EPIC implements a form of very long instruction word (VLIW) architecture, in which a single instruction word contains multiple instructions. With EPIC, the compiler determines in advance which instructions can be executed at the same time, so the microprocessor simply executes the instructions and does not need elaborate mechanisms to determine which instructions to execute in parallel."
The problem with all the approaches that depend on AOT compilation is that no such "magic" compiler exists. And no, machine learning or AI is not the solution. ;-)
As I understood it, and as far as I can remember, the Mill AOT compiler has an easier job than that. The generic image already contains the parallelized instructions, the AOT just has to split those who are too wide for the given CPU.
Been a while since I saw the AOT talk tho. And as mentioned, so far it's all talk anyway.
> As I understood it, and as far as I can remember, the Mill AOT compiler has an easier job than that. The generic image already contains the parallelized instructions, the AOT just has to split those who are too wide for the given CPU.
In my opinion, this just moves the problem on a meta level. For the EPIC instructions of Itanium, one could encode multiple (parallel) instructions into one VLIW instruction. It was a huge problem to parallelize existing, say, C or C++ code so that this capability could be used. The fact that such a "smart compiler" turned out so hard to write was one of the things that broke Itanium's neck.
I openly have no idea by what magic a "sufficiently smart compiler" that can create such a "generic image [that] already contains the parallelized instructions" suddenly appears. How is it possible that compilers can suddenly parallelize the program, which turned out to be nigh impossible for the Itanium?!
It's been too long since I watched the videos, so unfortunately I don't remember the specifics. For reference, here's[1] the relevant one on the compiler aspect.
I do seem to recall that they seemingly had studied the failures of Itanium, and supposedly designed their architecture to not fall into the same pitfalls as with the EPIC/Itanium.
One aspect I recall is that while they have VLIW, different operations within the (compound) instruction
are issued in such a way that lets them be interdependent. Like, a single VLIW instruction could have an add and a multiply, where the result of the add is used as input for the multiplication. So while the operations are grouped in a single instruction, they're not executed strictly in parallel. There's a lot of other aspects too, that's just the one I remember.
But yeah, really curious to know how that pudding will turn out.
The Mill also doesn't exist in a usable form, nor have I heard anything from them in a couple of years, so, as you say, the proof is in the pudding. I hope it sees the light of day and meets their claims, but we will have to wait and see.
Modern x86 processors already do a lot of register renaming and speculative and out of order execution. Much of the huge complexity you're worried about already exists in modern CPUs in order to track and eliminate false data dependencies and to keep the CPU busy in the face of data hazards.
I'm well aware of the basics of ooo. What GP was proposing involves re-vectorizing a virtual ISA according to hardware width. I'm sure micro-op fusion and register renaming could theoretically be extended to do that. It's the complexity I was alluding to.
It's an entirely different approach than what the RISC-V folks are pushing for. It's great that this guy is working with them on the vector instructions, but I'm afraid it's too soon to claim a "right" way to go.
It's also not fair to compare instructions executed between SIMD and some huge vector register implementation. Most common RISC-V CPUs are likely to have smaller vector register from 256 to 512 bits wide.
> And yet the current RISC-V approach is not as good as MXP
I watched that presentation a while ago, and while the figures that are shown look nice, I suspect the crux is that I'm not sure whether MXP is practically implementable? I'm not at all an expert on this topic, so take this with a large grain of salt. Anyway:
1) With MXP instead of a vector register file you have a scratchpad memory, i.e. a chunk of byte-addressable memory in the CPU. Now, if you want multiple vector ALU's (lanes), that scratchpad then needs to be multi-ported, which quickly starts to eat up a lot of area and power. In contrast, a vector regfile can be split into single-ported per lane chunks, saving power and area.
2) MXP seems to be dependent on these shuffling engines to align the data and feed to the ALU's. What's the overhead of these? Seems far from trivial?
As for other potential models, I have to admit I'm not entirely convinced by their dismissal of the SIMT style model. Sure, it needs a bit more micro-architectural state, a program counter per vector lane, basically. But there's also a certain simplicity in the model, no need for separate vector instructions, for the really basic stuff you need only the fork/join type instructions to switch from scalar execution to SIMT and back. And there's no denying that SIMT has been an extremely successful programming model in the GPU world.
> It's also not fair to compare instructions executed between SIMD and some huge vector register implementation. Most common RISC-V CPUs are likely to have smaller vector register from 256 to 512 bits wide.
True; the more interesting parts is the overhead stuff. Does your ISA require vector load/stores to be aligned one a vector-size boundary? Well, then when vectorizing a loop you need a scalar pre-loop to handle the first elements until you hit the right alignment and can use the vectorized stuff. Similarly, how do you handle the tail of the loop if the number of elements is not a multiple of the vector length? If you don't have a vector length register or such you need a separate tail loop. Or is the data in memory contiguous? Without scatter-gather and strided load/store you have to choose between not vectorizing or packing the data.
That bloats the code and is one of the reasons why autovectorizing for SIMD ISA's is difficult for compilers, as often the compiler doesn't know how many iterations a loop will be executed, and due to the above a large number of iterations are necessary to amortize the overhead. With a "proper" vector ISA the overhead is very small and it's profitable to vectorize all loops the compiler is able to.
Comparing dynamic instructions between a SIMD architecture with a 32 byte vector width versus a vector architecture with 8*64=512 byte vectors is laughably misleading. Of course you can use fewer instructions if you're willing to throw hugely more transistors at the problem and carry around so much more architectural state.
There are reasons to prefer SIMD or vectors machines or, put another way, packed or unpacked vectors. But this is a very one-sided presentation. Also, some SIMD ISAs like Arm's SVE can handle different widths pretty nicely.
Keep in mind this was after configuring it to have exactly two vectors active. There might only be 1KB of state in there. That's halfway between AVX and AVX-512, so it doesn't strike me as particularly biased.
Prediction: In order to be performance-competitive with other ISAs for software written for SIMD, RISC-V will get a SIMD extension. However, because it wasn't there from thw start, Linux distros will not compile their packages with SIMD enabled and the result will be sad like NEON on 32-bit ARM.
> While a simple vector processor might execute one vector element at a time, element operations are independent by definition, and so a processor could theoretically compute all of them simultaneously. The widest data for RISC-V is 64 bits, and today’s vector processors typically execute two, four, or eight 64-bit elements per clock cycle.
So does this argument boil down to an inversion of control which in turn removes unnecessary instructions? It certainly sounds more elegant to my naive ISA understanding.
Can I ask, someone with hands on SIMD experience: does relinquishing control over exactly what and how many "vector" operations occur in a single clock make any real world difference?
Why doesn't this comparison include ARMv8 and ARMv8 NEON? ARMv8 NEON does support double precision and that can help DAXPY. I believe this has been the case since 2011 when AArch64 was announced (well, at least 2015).
I have a computer engineering degree from UIUC and my very first thought upon seeing MMX/SSE/Altivec/etc was "why didn't they make this arbitrarily scalable?" I was excited to be able to perform multiple computations at once, but the implementation seemed really bizarre to me (hardcoding various arithmetic operations in 128 bits or whatnot).
If it had been up to me, I would have probably added an instruction that executed a certain number of other instructions as a single block and let the runtime or CPU/cluster divvy them up to its cores/registers in microcode internally.
It turns out that something like this is conceptually what happens in vector languages like MATLAB (and Octave, Scilab, etc), which I first used around 2005. It's implementation is not terribly optimized, but in practice it doesn't need to be, because all personal computers since the mid 1990s are limited by memory bandwidth, not processing power.
For what it's worth, we're seeing similar ideas in things like graphics shaders, where the user writes a loop that appears to be serial and synchronous, but is parallelized by the runtime. I'm saddened that they had to evolve via graphics cards with their unfortunate memory segmentation (inspired by DOS?) but IMHO the future of programming will look like general-purpose shaders that abstract away caching so that eventually CPU memory, GPU memory, mass storage, networks, even the whole internet looks like a software-transactional memory or content-addressable memory.
We'll also ditch frictional abstractions like asynchronous promises in favor of something like the Actor model from Erlang/Go or a data graph that is lazily evaluated as each dependency is satisfied so it can be treated as a single synchronous serial computation. I've never found a satisfactory name for that last abstraction, so if someone knows what it's called, please let us know thanks!
P.S. the point of all this is to provide an efficient transform between functional programming and imperative programming so we can begin dealing in abstractions and stop prematurely optimizing our programs (which limits them to running on specific operating systems or hardware configurations).
> IA-32 instruction set has grown from 80 to around 1400 instructions since 1978, largely fueled by SIMD.
Holy quack! I didn't even know there were 80 (feels too much already, I barely used a tiny portion when exercising in assembly), 1400 sounds really insane.
This is more than a little misleading. There's 8 different opcodes for each of INC, DEC, ADD and SUB, 6 each for AND, OR, XOR. MOV alone is 28 different opcodes. All of these groups of opcodes represent the same basic operation, but each opcode varies on addressing modes, and types of arguments (e.g. there's a separate INC/DEC opcode for each register)
Much in the same way, AVX adds only 8 completely new instructions, but adds new 256-bit variants for many pre-existing SSE instructions. This generates enormous amounts of opcodes, without actually increasing complexity _that_ much.
Long answer: Intel and AMD loooove length prefixes that basically move the base instruction into a new space. Because x86/x86-64 is a variable length ISA they can get away with this. When people like to talk ISA opcode bloat they deliberately include all the prefixes as 'separate opcodes'. Whereas most users would see them as the same opcode, but with modifiers that are not actually 'instructions' per se as the core opcode didn't change. Historically however there are some quirky op codes because of how things were implemented as mentioned elsewhere. So some instructions may have one memnonic but multiple core opcodes due to things like r/m vs reg/reg. This was because traditionally x86 didn't support three operand op-codes, and still doesn't for many general non-vector instructions.
The previous poster is slightly understating the changes made in the newer SIMD ISAs. AVX and AVX-512 don't just add longer vector variants to previous instructions, they also add different modes to vector instructions.
AVX introduced three-operand instructions (essentially rA = rB op rC) instead of the normal two-operand instructions (rA = rA op rC) that is typical for x86. AVX-512 introduced vector masks as well (and per-instruction rounding modes).
Some quick overview of the x86 ISA: each instruction is essentially an opcode (of 1-3 bytes) followed by a "modR/M" byte. These bytes encode a 3-bit register number and a second input operand which is either an immediate, a single register, or a memory operand that has 2 registers, an immediate, and a scale (multiply by 1/2/4/8) parameter. Legacy prefixes provide a segment override, address size override, and operand size override capability--essentially controlling if the register should be referred to as dx or edx. Knowing if the register is eax or xmm0 is dependent on what the opcode is; they're both encoded as 0.
Extending the ISA to 64-bit added a REX prefix, which provides a bit to indicate if its 64-bit or 32-bit, as well as three extra register selector bits that get prepended to the modR/M results. The VEX prefix, introduced for AVX, includes all of the bits in the REX prefix, as well as another 4-bit register selector (the three-operand form as mentioned above), a 1-bit for 128-bit or 256-bit vectors, and another 2 bits of opcode table extension. The EVEX prefix (for AVX-512) includes all of the bits from the VEX prefix, as well as another vector length bit, 3 bits for accessing zmm16-31 registers, 3-bits for a vector mask register, another bit for masking mode, and another bit for rounding mode.
It sounds complicated, but most of these bits are actually just providing either an extension to the size of the register set, the size of the operation being done, or the introduction of a few new operands into the operation. By the time you leave the decoder, there's not really any new data being passed onto later hardware stages.
A bigger problem is the relative length of these new instructions. I've noticed some more recent AVX instructions spreading over 8 or 9 bytes in disassembly.
The problem with this is that when the processor stalls on an instruction fetch for the next cacheline of code, it just sits there idle for the entire time. This greatly elongates your tails when looking at performance in terms of latency percentiles.
It makes me wonder if Intel or AMD have investigated a MicroVAX-like trimming or compression of the opcodes so that the most common/useful codes fit in the fewest bytes. In particular it seems like SIMD lengths are inverted, the longest vectors should have shorter opcodes since they're more useful. It might even be worth deprecating MMX/128-bit SSE.
AMD64 came out in 2003, a new decoder might be appropriate by 2023.
We're past the time that a human needs to understand assembly instructions.
In the future, instructions will be designed by machine, for example by considering millions of permutations of possible instruction "combined add with shift with multiply by 8 and set the 6th bit", "double indirect program counter jump with offset 63", etc.
Each permutation will be added to various compilers and simulated by running benchmarks on complete architecture simulators to find out which new instruction adds the most to the power to die area to performance to code size tradeoff.
I predict there will be many more future instructions with 'fuzzy' effects which don't affect correctness, only performance. Eg. 'Set the branch predictor state to favor the next branch for the number of times in register EAX', or 'go start executing the following code speculatively, because you'll jump to it in a few hundred clock cycles and it would be handy to have it all pre-executed'.
Security may throw a wrench into that. Preventing Spectre et al in such an environment would be a challenge. Not mathematically insurmountable, but possibly unsurmountable with real humans and real economics.
Isn't the whole point of SIMD being as similar to original x86 instructions as possible? reusing as much the existing cpu as possible? Otherwise you would have something like the ps3?
Yes and no. SIMD (Single Instruction Multiple Data) as a concept has nothing to do with x86, it's basically just the concept of vectorizing the code and is used on many platforms.
The x86 SIMD extensions such as SSE and AVX, on the other hand, aim to integrate that concept with x86 and are therefore pretty similar.
If you care about latency, a modern 8-or-more core x86 with its L1/L2 cache segmentation and penalized-but-shared L3 cache is almost as complex. It becomes even more complex if you use the CPU topology to make inferences hyperthreading shared caches or need to deal with the shared FPU on older AMD processors.
My understanding is that the largest difference is that some of the Cell cores had different opcodes that meant you could schedule some threads on some cores but not any thread on any core.
I have written quite a bit of SPE code. The primary issue is that the SPE processor could only read/write to 256kB localized memory (without doing a DMA). So literally object orientated code doesnt even work (because of VTables). The c/c++ model is not designed for this type of architecture. Yes there were also limitations like vector only registers and memory alignment but the biggest issue was the local memory.
Too far away and requires huge thread counts to make up for its overheads. A good vector unit should work well even for very short vectors (e.g., any size of memcpy).
We had numerous different examples of this in the Hyperscan project and I broke out something similar on my blog: https://branchfree.org/2018/05/30/smh-the-swiss-army-chainsa...
We also used SIMD quite extensively as a 'wider GPR' - not doing stuff over tons of input characters but instead using the superior size of SIMD registers to implement things like bitwise string and NFA matchers.
A SIMD instruction can be a reasonable proxy for a wide vector processor but the reverse is not true - a specialized vector architecture is unlikely to be very helpful for this kind of 'mixed' SIMD processing. Almost any "argument from DAXPY" fails for the much richer uses of SIMD processing among active practitioners using modern SIMD.