The constant widening of the SIMD registers is an interesting trend to watch. MMX started off as 64-bit, and now AVX-512 is 8 times that. With many cores and a couple of SIMD units in each core, aren't CPUs becoming ever more suitable to handle graphic workloads that were once reasonable only for GPUs?
GPUs are still about 5x ahead compared to CPUs. CPUs lack texture lookup hardware, that's the biggest catch right now. AVX2 gives gather support (one instruction, up to 8x independent loads to a single vector register), but currently on Haswell there's no performance benefit compared to doing the memory loads individually.
GPU memory (GDDR5) has also many times more bandwidth at expense of 10x higher latency compared to CPU.
Currently you put the workload that requires low latency on a CPU and embarrassingly parallel "non-branchy" workload on a GPU.
Both CPUs and GPUs gain about 20% performance per year per core. I'm no expert on CPU and GPU memory architectures, but both seem to be headed towards integrated 3D-stacked eDRAM memory. This can provide up to several TBps reasonably low latency bandwidth.
Everything vardump said is true, but additionally there's the consideration of what isn't there as well as what is there. By not having an out of order execution setup or a broad forwarding network a GPU core is able to be much smaller than a CPU core of equivalent throughput. By "core" here I mean something capable of independently issuing instructions and executing them, not calling each execution unit a "core" the way Nvidia does (though with SIMT that isn't as crazy as it might look at first). That means you can pack more cores on a chip and use less power for a given amount of work, since power consumption is roughly proportional to the number of transistors used. If we eventually start doing graphics on CPUs it will because we move to ray tracing or other sorts of algorithms that GPUs are bad at.
One more thing. GPUs don't have to worry about precise interrupts like if you try to load a piece of memory only to find that you have to page it in or such. If they're in a memory protected setup where that's even possible it'll be the CPU who has to deal with the mess, the GPU can just halt while that's taken care of.
This makes SIMT easy. SIMT is "Single Intstruction Multiple Threads" which is sort of like SIMD but better. You have an instruction come in which is distributed to multiple execution lanes each of which has a hopper of instructions it can draw from. Each lane then executes those instructions independently and if the lane notices that the instruction has been predicated out it just ignores it. Or maybe if one lane is behind it can give the instruction to its neighbor which isn't. THe fact that you don't have to be able to drop everything in a single cycle and pretend you were executing perfectly in order gives the hardware a lot of flexiblity, and all of this complexity is only O(n) with the number of executions instead of O(n^2) as with a typical OoO setup. The need to have precise exceptions involving SIMD instructions means that this isn't a simple thing for a regular CPU core to add to its SIMD units.
The fact that each lane is making decisions about when to execute the instructions it has been issued are why some people refer to the lanes as "cores". I don't, because what they're doing isn't any more complicated than the 8 reservation stations in a typical Tomasulo algorithm OoO CPU core would be doing even if they are smarter than a SIMD lane. With GPUs it makes more sense to break down "cores" by instruction issue, in which case a high end GPU would have dozens rather than thousands of cores.
I think GPUs do have faulting mechanisms, if that's what you meant by "precise interrupts". How else they bus master page memory over PCIe?
> Each lane then executes those instructions independently and if the lane notices that the instruction has been predicated out it just ignores it.
This is exactly what you do on CPUs as well. In SSE/AVX you often mask unwanted results instead of branching. Just like on GPUs. AVX has 8 lanes, 16 with Skylake.
Regarding faulting mechanisms, if you've got a discrete GPU on a PCI bus then it's a separate piece of silicon that handles the network protocol. The important point is that I don't believe that the GPU cores have to be able to halt execution and save their state at any instruction.
It's certainly true that SIMD instructions in CPUs have predication which saves you a lot of trouble. The difference is that if you have two instructions which are predicated in a disjoint way you can execute them both in the same cycle in a SIMT machine but you would have to spend one cycle for each instruction in a SIMD machine. You can look at Dylan16807's link for all the details.
GPUs don't support precise exceptions. For example, you can't take a GPU program that contains a segfault, run it as a standard program (as in, not in a debug mode), and be presented with the exact instruction that generated the fault.
>If we eventually start doing graphics on CPUs it will because we move to ray tracing or other sorts of algorithms that GPUs are bad at.
Even for ray tracing, I think you'll get better results on a modern GPU. Each individual processing element may not be anywhere near as good at dealing with the constant branch prediction failures inherent in ray tracing algorithms as modern CPU cores are, but there are thousands of them for every CPU core you have, and millions of completely independent samples to compute in a raytraced image. But I suspect that by the time that realtime raytracting becomes viable, the HSA approach I outlined above will be viable as well, and that will be the way to go.
I usually call nvidia's SMX a core. Nvidia's warp/hw thread architecture is sure interesting, but one has to remember it's mostly to get around memory bottlenecks. All those contexts need also a lot of gates. GDDR5 has high throughput but has an order of magnitude higher latency.
Ray-tracing doesn't really buy much unless you're doing some specific subset of effects. Rasterization covers 99% of cases with higher efficiency.
I had a pretty big post going, but vardump and Symmetry conveyed the gist of what I was going for much more succinctly, so I'll summarize:[1]
GPU cores are tiny because the problems they deal with are "embarrassingly parallel", trivially solved by throwing more cores at the problem. You make the cores as simple as possible so you can have thousands of them on a chip. Modern GPUs don't even have SIMD units per core any more; both NVidia and AMD are completely scalar now. You'd think that graphics would be the perfect scenario for SIMD, since shaders spend so much time dealing with 3 and 4D vectors, transformation matrices, colors, and so on, but it worked out that the gain in throughput and instruction density per-core was outweighed by the power, heat, and die cost of having parts of those thousands of SIMD units sitting idle while working on data that doesn't take up a whole SIMD register. And because context switches are much rarer on GPUs than CPUs, they can have extremely deep pipelines that push compute efficiency even further, at the expense of context switch latency.
CPU cores are, if I may, "fuckhuge", because by and large they can't solve their problems by throwing more cores at them. They take on problems that are inherently serial and branch heavy, like compiling a program or optimally compressing a large file, and throw bigger cores at them (out-of-order execution, branch prediction, speculative execution, multiple ALUs per core, instruction schedulers that exploit the parallelism hidden in the serial instruction stream, large register files only visible to the microarchitecture, etc) while maintaining a fairly short pipeline so that branch prediction failures and context switches don't take too long to recover from. SIMD fits in well here, because the cost of bigger and bigger ALUs is pretty much insignificant compared to all the other hardware that goes into a high-end CPU core. It can be a pain to optimize for, but it's great for middle-ground tasks that need both heavy parallel and serial/branch-heavy computing resources with little latency between the two, like video compression.
So you can't make one as good at solving the tasks of the other without making it worse at the task it's already intended to do. They tried to do this with Larabee, which was, no exaggeration, a few hundred Pentiums on a chip. Very interesting, but unsuccessful in the market, because it was an expensive niche product that was worse than either a CPU or GPU for the majority of their respectful workloads. What's really interesting, however, is the direction that AMD is taking. Rather than pushing SIMD like Intel, they're trying with their "heterogeneous systems architecture" movement to break down the communication barrier between CPUs and GPUs so that sharing data between the two is as easy as passing a pointer. SIMD is still great on CPUs for dealing with higher level primitives like vectors, matrices, or image macroblocks, but I can easily see the aforementioned middle-ground shifting to CPU threads working in close concert with GPU worker threads, as opposed to the mostly hands-off, one way street common in i.e. 3D rendering today.
[1] On second thought, this didn't turn out to be much of a summary, did it?
>>Modern GPUs don't even have SIMD units per core any more; both NVidia and AMD are completely scalar now.
This is factually incorrect. Let me quote Vasily Volkov [1]: "Earlier GPUs had a 2-level SIMD architecture — an SIMD array
of processors, each operating on 4-component vectors. Modern
GPUs have a 1-level SIMD architecture — an SIMD array of
scalar processors. Despite this change, it is the overall SIMD
architecture that is important to understand."
I'm aware of AMD's terminology here, and I disagree with it. They say "SIMD array of scalar processors" where "SIMT" makes much more logical sense to me. I believe the only reason that they maintain this "2-level vs. 1-level SIMD" distinction is because NVidia coined the term SIMT, so they refuse to use it due to marketing concerns/some possible trademark or other IP law liability/pride.
...and now I read your paper, and I need to eat humble pie, because it apparently predates NVidia's use of the term "SIMT." Apparently, even if it makes more logical sense, SIMT is the buzzword!
Either way, I think it's a stretch to say I'm "factually incorrect" just because I used different terminology. Whether you think of it as an "SIMD array of scalar processors", or (to use NVidia terminology) an "SIMT warp of scalar threads" the important thing here is that the execution units in the bundle with the shared instruction pointer operate on scalar values now, rather than 4-element vectors.
What GPU vendors call cores aren't. It's like calling hair brush pins as hair brushes.
I think something should be called "core" if it can branch instruction stream. Predication doesn't count. A single core can run multiple instruction streams (GPUs, Intel Hyperthreading).
You can compile different files in parallel. You can probably even compile a single file in parallel. Consider this: If you start reading in the middle of a source file, can you make sense of what you see, or is it incomprehensible garbage unless you start at the top?
The answer is that it mostly makes sense except you may be tricked by #defines for example. Thus you could probably "optimistically parse" small chunks of the source code, all in parallel, with the understanding that you may have to throw out some intermediate results in light of new understandings.
Yeah, of course there are tasks that can be executed with coarse-grained parallelism, that can take advantage of multiple threads. It's not embarrassingly parallel though like with graphics, where you have millions of pixels that can be shaded independently of all the other pixels in an image, per rendering pass. In an optimizing compiler, compiled functions are not independent of each other, because data flow analysis will affect code generation, so you can't just compile each function on a different thread and combine the results. At least on a per-file basis, compilation has an inherently serial bottleneck, and that's not considering that 1. not all languages have separate compilation and linking like the C family, and 2. even with whole program optimization disabled, the linking phase can be fairly complicated in modern linker, so you have yet another serial bottleneck. And if you're implementing an non-optimizing compiler, then even with the naive, completely serial compiler, the whole process won't take long enough for it to be worth going to this trouble.
Glitch works this way. It will simply redo the tree when better data is available (via dependency tracing); so you can parse, type check various parts within and between files using what is essentially optimistic parallelism.
This is quite useful for non regular problems that are otherwise difficult to parallelize.
I should have been clear, I'm talking about Intel and AMD's monster desktop and server-class CPUs here, not mobile. For that matter, mobile GPUs are very different from desktop/console/HPC-class GPUs as well.
From what I could gather from building and testing a new desktop a couple months ago AVX instruction loads will trip CPU voltage levels to increase over 'normal' and generate a ton of heat as a result, often throttling if you aren't using aftermarket cooling. I would guess that AVX has a long way to go in terms of power efficiency before you'll see any large use of it.
>If we set _foo to 0 and have two threads that both execute incl (_foo) 10000 times each, incrementing the same location with a single instruction 20000 times, is guaranteed not to exceed 20000, but it could (theoretically) be as low as 2.
Initially it seemed to me the theoretical minimum should be 10000 (as the practical minimum seems to be). But it does indeed seem possible to get 2:
1) Both threads load 0
2) Thread 1 increments 9999 times and stores
3) Thread 2 increments and stores 1
5) Both threads load 1
5) Thread 2 increments 9999 times and stores
6) Thread 1 increments and stores 2
Is there anything in the x86 coherency setup that would disallow this and make 10000 the actual minimum?
I think not. The only limit seems to be the number of times one CPU can increment while the other is between load and store operations.
If you could manipulate your two cores with enough dexterity, you could force a 2 result, but without enough fine-grained control, the practical limit is going to be determined by the number of increments that can be "wasted" by sandwiching them between the load and store operations of the other thread. You would essentially need to stop and start each CPU at four very specific operations.
The practical results show that "load + add + store + load + add + store" on one thread probably never happens during a single "add" on the other thread. You would need that to happen at least once to get below 10000. Otherwise, each increment can waste no more than one increment for the other thread, and you end up with at least 10000.
The experimental numbers are probably indicative of how long the add portion of INCL takes in relation to the whole thing.
Unless you use a "lock inc", increment is semantically a read-modify-write, so I believe those steps can technically happen on x86. However, since a thread cannot be interrupted in the middle of an instruction, it's very unlikely that 9999 operations can happen in one thread while another executes an increment. If you wrote it as a mov, an add, and another mov, all it would take is a context switch at exactly the right time.
On the other hand, 20000 is unlikely as well, if the threads have any concurrency at all.
The author is incorrect in the section about memory fences. x86 has strong memory ordering [1], which means that writes always appear in program order with respect to other cores. Use a memory fence to guarantee that reads and writes are memory bus visible.
The example that the author gives does not apply to x86.
[1]There are memory types that do not have strong memory ordering, and if you use non-temporal instructions for streaming SIMD, SFENCE/LFENCE/MFENCE are useful.
Modern X86 machines do not quite have strong memory ordering. Loads may be reordered with older stores to different locations. This breaks some of the overly clever "lock free" algorithms:
x86 enforces (essentially) total store order across all sockets: http://www.cl.cam.ac.uk/~pes20/weakmemory/index3.html. The barriers are still useful for kernel code because other processors on the machine usually don't participate in the cache-coherency protocol.
For a somewhat dated read for an introductory/high-level survey of the state of CPU design up to 2001, I find [1] to be quite informative. Of course, having been written in the frequency scaling era some of the 'predictions' are way off the mark. Nevertheless, I find it a good resource for someone looking to get a bird's eye view of the 30+ years development in the field.
If you enjoyed this article, you might also want to check out Agner Fog's optimization manuals and blog (which I decided were probably worth a separate submission): https://news.ycombinator.com/item?id=8874206
Even if one programs in assembly, such features are a "black box" because instruction sets rarely provide any instructions for controlling these things, apart from issuing cache prefetching hints which are not mandatory for the CPU to honour.
I do not think one has full control over OOO execution on x86-64 processors. Also I do not believe one has control over the execution pipeline even in assembly, although I do not know exactly what you mean by that, so it could just be a misunderstanding.
And headers for memory subsystem (DDR SDRAM)! I mean, I know usually there's cache line sized interleaving repeating for each memory channel, but it would sure be nice to reliably control which memory access goes to which memory module.
With NUMA physical memory ordering gets even uglier, usually each physical socket's memory is in a big chunk, but sometimes it's also interleaved every 4 kB.
ANSI C doesn't expose any intrinsics, but once you get used to GCC extensions and attributes, it starts to feel a lot more "low-level". Think of __builtin_expect, __builtin_prefetch, alignment attributes, etc.
It's not quite like writing assembly because these intrinsics are meant to translate to the "best" instructions for the particular ISA that the code is targeted for. You are able to compile the same chunk of code for two different architectures, for example. If you were to write inline assembly, you would be forced to have a separate chunk of code for each architecture and surround them with #ifdefs.
The code can be made portable with a few simple macros (which is why they are implemented they way they are in GCC), though the optimizations will not be.
> Even most common extensions just cover part of them.
The headers I linked to here: https://news.ycombinator.com/item?id=8873764 Cover most useful Intel CPU extensions. And you can use them easily in code, it just sucks when you have to do CPU feature detection and have different code paths for different CPUs.
> it just sucks when you have to do CPU feature detection and have different code paths for different CPUs.
In the Linux kernel they went so far as to actually replacing instructions at boot-time at known locations, based on the capabilities of the CPU the code is executing on. This prevents them from maintaining many paths in the compiled code.
Games on consoles, where the hardware architecture is frozen and you can optimize to the metal all you want.
Not all titles do this, since most titles are cross-platform and you're not going to do tons of optimization for a specific one unless there is a huge payoff.
I've used a lot of SSE2 and prefetch instincs in Visual C++ for math / data heavy high performance computing applications. It can make a fairly big different -- like more than 50% performance improvements for key parts of the code.
That depends heavily on with CPU and memory subsystem combination you're running it on. You need to know how far to prefetch and how to order your data in memory.
Many of the capabilities described in the article are just hardware features which don't need any control from application developers, but their effectiveness can be improved if the developers are aware of them (e.g. caches, TLBs). Others properties are exploited automatically by compilers (e.g. instruction scheduling for improving throughput in a superscalar processor)
Instruction set extensions (which I believe might be what you were thinking of) are used via intrinsics and assembly programming by pretty much any performance-oriented code. For example look at the dozens of architecture-specific implementations of glibc functions: x86-64[0], arm[1].
Speaking of, I think we'll see nvme being the new hotness this year. NVME is going to bring crazy parallelism and queue depth to block devices. Finally we'll be able to exploit flashs inherent parallel access in a sane way.
> If we set _foo to 0 and have two threads that both execute incl (_foo) 10000 times each, incrementing the same location with a single instruction 20000 times, is guaranteed not to exceed 20000, but it could (theoretically) be as low as 2.
Can someone explain the scenario where this test would result in _foo = 2? The lowest theoretical value I can understand is foo_ = 10000 (all 10000 incls from thread 1 are executed between one pair of load and store in thread 2 and hence lost).
You're almost there! Basically that swapping you outlined occurs twice, once on each side.
1. Both threads read '0', beginning their first incl instruction (A: 0, B: 0, Store: 0)
2. Thread A completes execution of all BUT ONE incl, writing it's second to last value to the store (A: 9999, B: 0, Store: 9999)
3. Thread B increments it's '0' to '1' (A: 9999, B: 1, Store: 9999)
4. Thread B writes it's '1' to the store (A: 9999, B: 1, Store: 1)
5. Thread A begins it's final incl and reads the '1' that thread B just stored (A: 1, B: 1, Store: 1)
6. Thread B executes all remaining incl instructions, writing its final value to the store (A: 1, B:10000, Store: 10000)
7. Thread A continues it's final incl and increments it's '1' to '2' (A: 2, B:10000, Store: 10000)
8. Thread A stores it's final '2' result (A: 2, B:10000, Store: 2)
9. All instructions have executed, and the result is '2'
It is, of course, completely theory and extremely unlikely under real use as the authors graph shows, but it's one of those threading 'gotchas' that lead to occasionally unpredictable results.
These features debuted way before the 80s, arguably with the exception of out of order execution. Take any feature discussed and its wikipedia history section will tell you about CPUs and computers in the 60s that first had them. Caches, SIMD (called vectors back then), speculative execution, branch prediction, virtual machines, virtual memory, accelerated IO bypassing the CPU... all from the 60s.
One of the other weird things is that the general purpose chips are supposed to be getting large chunks of memory behind the northbridge. If I'm not mistaken, it's supposed to be used as another level in the cache hierarchy, which could be another boon to things that leverage concurrency.
Nice improvements. But they seem only marginal. Yes, speed may have gone up by a certain factor (which has remained surprisingly stable in the last decade, or so it seems).
On the other hand, programming complexity has gone way up, while predictability of performance has been lost.