Hacker News new | past | comments | ask | show | jobs | submit login
What's new in CPUs since the 80s? (2015) (danluu.com)
252 points by snvzz on April 22, 2022 | hide | past | favorite | 106 comments



Interesting point about L1 cache sizes and the relationship between page size

> Also, first-level caches are usually limited by the page size times the associativity of the cache. If the cache is smaller than that, the bits used to index into the cache are the same regardless if whether you're looking at the virtual address or the physical address, so you don't have to do a virtual to physical translation before indexing into the cache. If the cache is larger than that, you have to first do a TLB lookup to index into the cache (which will cost at least one extra cycle), or build a virtually indexed cache (which is possible, but adds complexity and coupling to software). You can see this limit in modern chips. Haswell has an 8-way associative cache and 4kB pages. Its l1 data cache is 8 * 4kB = 32kB.

Having helped build the virtually indexed cache of the arm A55 I can confirm it's a complete nightmare and I can see why Intel and AMD have kept to the L1 data cache limit required to avoid it.

Interestingly Apple may have gone down the virtually indexed route (or possibly some other cunning design corner) for the M1 with their 128 kB data cache. However I believe they standardized on 16k pages which would allow still allow physical indexing with an 8 way associative cache. So what do they do when they're running x86 code with 4k pages? Does they drop 75% of their L1 cache to maintain physical indexing? Do they aggressively try and merge the x86 4k pages into 16k pages with some slow back-up when they can't do that? Maybe they've gone with some special purpose hardware support for emulating x86 4k pages on their 16k page architecture. Have they just indeed implemented a virtually indexed cache?


> with their 128 kB data cache

Bigger than that. (https://github.com/apple/darwin-xnu/blob/2ff845c2e033bd0ff64...)

/* I-Cache, 256KB for Firestorm, 128KB for Icestorm, 6-way. /

/ D-Cache, 160KB for Firestorm, 8-way. 64KB for Icestorm, 6-way. */

Apple also uses a base 16KB page size on macOS, with 4KB being present for backwards compatibility (x86_64 apps) and the wider ecosystem (running Windows for example).

> Maybe they've gone with some special purpose hardware support for emulating x86 4k pages on their 16k page architecture

Nah, it's that when you run an x86 process, one of the TTBRs is pointed to a page size corresponding to 4KB pages.


Firestorm L1 d$ is 128kB most diagrams and documentation will show that. This latency measurement pretty well confirms it too -

https://www.anandtech.com/show/16226/apple-silicon-m1-a14-de...

Apple definitely takes advantage of VI=PI for their caches.

160kB is not a good size for a cache, surely must be a typo.

i$ has more options for handling aliases. You don't actually even need to eliminate them at all because it's a read-only cache. So it's not unusual to see these caches exceed the VI=PI geometry.


That’s what Apple explicitly says. Why they don’t match with measurements is another matter… (and no, it’s not a typo)

ECC? Needs hugepages? Or hardware erratum? I don’t know for sure.

Also, at least from the VM perspective icache is reported as PIPT, which is not too surprising but still worth noting.


> That’s what Apple explicitly says. Why they don’t match with measurements is another matter… (and no, it’s not a typo)

Ridiculous. You're just making things up. It clearly doesn't have a 160kB 6-way L1 dcache. Nor are the small cores 64kB 6-way -- those don't even divide that's ridiculous. Why do you say it's not a typo?

They "explicitly" say it is 160kB in that backwater code comment. They also explicitly say it's 128kB in this presentation https://www.youtube.com/watch?v=5AwdkGKmZ0I&t=534s so what makes your source the authoritative one?? Pretty obviously it's a quick cut and paste job with a typo (two in one line actually) in it.

> ECC? Needs hugepages? Or hardware erratum? I don’t know for sure.

ECC? Erratum? Come on. It's clearly because it's 128kB.


I actually asked and got told that it’s not a typo. But not more details beyond that so I can only speculate…

And yeah, 192KB/128KB is what the HW behaves as by all accounts by measurements, making this even more odd…


Okay so it was deliberately written by someone who was misinformed about at least two properties of the caches then, both size and associativity.


> Do they aggressively try and merge the x86 4k pages into 16k pages with some slow back-up when they can't do that?

This does not seem feasible because the 16k pages on ARM are not "huge" pages; it's a completely different arrangement of the virtual address space and page tables. The two are not interoperable.


Perhaps that explains the odd cache sizes. Map x86 onto it's own cache when necessary.


This took me a while to click, but:

- cache entries are assumed to be looked up by the virtual address (not physical), to save a TLB lookup

- within a single page, the page line index is the same in both a physical and a virtual address

- if the cache has enough sets, it doesn't just use the line bits, but also some bits of the page address

Here's the crucial bit: a single physical page address may correspond to several different virtual addresses, so if the cache kept using virtual addresses, it might have had many copies of the same line. Kaboom.

That's why having more sets in a cache than lines in a page is troublesome.


Please add an RSS or Atom feed to your blog :)


2015. A good exercise would be "What's new in CPUs since 2015?" A few I can think of: branch target alignment has returned as a key to achieving peak optimization, after a brief period of irrelevance on x86; x86 user-space monitor/wait/pause have exposed for the first time explicit power controls to user programs.

One thing I would have added to "since the 80s" is the x86 timestamp counter. It really changed the way we get timing information.


L3 caches have grown monstrously.

The new AMD Ryzen 5800x3d has 96MB of L3 cache. This is so monstrous that the 2048x entry TLB with 4kB pages only can access 8MB.

That's right, you run out of TLB-entries before you run out of L3 cache these days. (Or you start using hugepages damn it)

----------

I think Intel's PEXT and PDEP was introduced around 2015-era. But AMD chips now execute PEXT / PDEP quickly, so its now feasible to use it on most people's modern systems (assuming Zen3 or a 2015+ era Intel CPU). Obviously those instructions don't exist in ARM / POWER9 world, but they're really fun to experiment with.

PEXT / PDEP are effectively bitwise-gather and bitwise-scatter instructions, and can be used to perform extremely fast and arbitrary bit-permutations. I played around with them to implement some relational-database operations (join, select, etc. etc.) over bit-relations for the 4-coloring theorem. (Just a toy to amuse myself with. A 16-bit bitset of "0001_1111_0000_0000" means "(Var1 == Color4 and Var2==Color1) or (Var2==Color2)".

There's probably some kind of tight relational algebra / automatic logic proving / binary decision diagram / stuffs that you can do with PEXT/PDEP. It really seems like an unexplored field.

----

EDIT: Oh, another big one. ARMv8 and POWER9 standardized upon the C++11 memory model of acquire-release. This was inevitable because Java and C++ standardized upon the memory model in the 00s / early 10s, so chips inevitably would be tailored for that model.


> That's right, you run out of TLB-entries before you run out of L3 cache these days.

This is more reasonable than it sounds. A TLB miss can in many cases be faster than a L3 cache hit


It's also misleading because it has 8 cores and each of them has 2048 l2 TLB entries. Altogether they can cover 64MiB of memory with small pages.


But 5800x3D has 96MB of L3. So even if all 8 cores are independently working on different memory addresses, you still can't cover all 96MB of L3 with the TLB.

EDIT: Well, unless you use 2MB hugepages of course.


That's another thing which is recent. Before Haswell, x86 cores had almost no huge TLB entries. IvyBridge only had 32 in 2MiB mode, compared to 64 + 512 in 4KiB mode.


Are you sure? TLB misses mean a pagewalk. Sure, the directory tree is probably in L3 cache, but repeatedly pagewalking through L3 to find a memory address is going to be slower than just fetching it from the in core TLB.

I know that modern cores have dedicated page walking units these days, but I admit that I've never tested the speed of them.


It only takes ~200KB to store page tables for 96MB of address space. So the page table entries might mostly stay in the L1 and L2 caches


I think you made an error in your assumptions.

Each 64byte cache line could feasibly come from a different page in the worst case.

I think modern processors actually pull 128 bytes from RAM at the L3 level, if each 128 L3 cache line is from a different page, that's 768k pages in the 96MB L3 cache.

That being said, huge pages won't help much in this degenerate case. So your assumption might be valid for this argument actually.

So maybe it's not that much of an error.


My estimate is for a small number of contiguous regions. It is true that if you adversarially construct a set of cache lines, you might need a far larger amount of memory to store page tables for them. Whether you consider that an "error" or just a simplifying assumption is a matter of opinion I suppose


PDEP/PEXT were part of the Intel Haswell microarchitecture, launched in 2013.

And yes, they can be extremely useful for efficient join operations in some contexts, that would be challenging to implement without those instructions. Also selection for some codes. Not everyone needs them, but when you need them you really need them. And those use cases are frequently worth it. I use them to implement a general algebra, much like you suggest.


Spectre. It was a vulnerability before 2015, but not known publicly until early 2018. It's hugely disruptive to microarchitecture, particularly with crossing kernel/user space boundaries, separating state between hyperthreads, etc.


100%. Feasible timing attacks mean we must look at all speculative execution with suspicion. But dang that's a lot of performance to give up.


Big.Little-like architecture? Even intel has adopted that in their 12 gen.

I believe a lot has happened around mobile and power as well. Apple boasts their progress every year, and at least some of them are real. But they are too secretive to talk about that. I hope some competitors have written some related papers. For example, the OP talks about dark silicon. What's going on around it these days?


Intel PT is another thing that's worth calling out since 2015 (see the other article on the front page right now, https://news.ycombinator.com/item?id=31121319, for something that benefits from it).

It does look like Hardware Lock Elision/Transactional Memory is something that seems like it will be consigned to the dustbins of history (again).


Intel did not ship even one working implementation of TSX, so it's not like anyone will be inconvenienced that they cancelled it.


A number of companies invested in doing the software development to take advantage of TSX (as the performance improvements helped databases by companies like Oracle), so Intel certainly lost a lot of credibility. And Intel is jerking software developers around again with the latest vector instructions that keep getting turned off in desktop / laptop SKUs and only being available on servers. Intel has done quite poorly over the past 5+ years on this front.


The section on power really understates the complexity. Throttling didn't appear until the mid-90's as a coarse clock gating chipwide. Voltage/frequency scaling appeared a few years later (gradual P-state transitions). Then power control units monitored key activity signals and could not only scale the voltage, but estimate power and target specific blocks (e.g., turning off L1 D$).

There are some more details in there but that's the main gist. The power control unit is its own operating system!


> Even though incl is a single instruction, it's not guaranteed to be atomic. Internally, incl is implemented as a load followed by an add followed by an store.

I've heard the joke that RISC won because modern CISC processors are just complex interfaces on top of RISC cores, but there really is some truth to that, isn't there?


Eh, it's overstated. Classic CISC tended towards microcode because it allowed complex operations in way that didn't consume main data memory bandwidth. RISC in a lot of ways was the recognition that when you can assume ubiquitous instruction caches for your ISA you can let your ASM writers just write what was microcode previously. Hot loops will be cached and out of the way of data bandwidth, but now there's no arbitrary limits to microcode program size.

So RISC was a revolution over CISC, but CISC arguably always had 'a RISC core' inside, even before RISC was a thing.


I'd argue the causality goes the other way. Microcode allows you to do complex things on top of a very minimal execution engine. You might use the same ALU for an add as you would for a multistage multiply as you would for a computed jump, etc...

But as CPUs got bigger and suddenly had enough hardware to do "everything" at once, they had the problem of circuit depth. Sure, you can execute the whole instruction in one cycle but it's a really LONG cycle.

You fix that problem with pipelining ([R]eally [I]nvented by [S]eymore [C]ray, of course).

But you can't pipeline a complicated microcoded instruction set. Everything that happens has to fit in the same pipeline stages. So, the instruction set naturally becomes "reduced".

Basically: RISC is the natural choice once VLIW gets rolling. It's not about simplification at all, it's about exploiting all the transistors on much more "complicated" chips.


Except older CPUs already sequenced heavily within an instruction. Even more than you might think. Z80 for instance only had a 4bit ALU and would pump it many times to get the bit width required. Early 808x would be 5 or so cycles per instruction on average. Internally though, their microcode issued once a cycle typically.

> But you can't pipeline a complicated microcoded instruction set. Everything that happens has to fit in the same pipeline stages. So, the instruction set naturally becomes "reduced".

That's what they said in the early 80s, but then the 486 came out. AFAIK the longest pipelined general purpose systems were also fairly heavily microcoded (Netburst).


The 80486 was only minimally pipelined, and in fact if you squint it fits what would later become the standard model: where an "expansion" engine at the decode level emits code for the later stages (which look more like a RISC pipeline with separated cache/execute/commit stages). That engine is still microcoded (because VLIW might have been rolling but no way can you do a uOp cache in 2M transistors), and still limited to multicycle instruction execution for all but the simplest forms.

Basically, if you were handed the transistor budget of an 80486 and told to design an ISA, you'd never have picked a microcoded architecture with a bunch of addressing modes. People who did RISC in those chip sizes (MIPS R4000, say) were beating Intel by 2-3x on routine benchmarks.

Again: it was the budget that informed the choice. Chips were bigger and people had to figure out how to make 2M transistors work in tandem. And obviously when chips got MUCH bigger it stopped being a problem because dynamic ISA conversion becomes an obvious choice when you have 200M transistors.


RISC does not resemble microcode all that much, VLIW is a lot closer to the mark. VLIW is also heavily μarch-dependent, much like microcode.


There's two kinds of microcode, vertical and horizontal. RISC resembles vertical microcode that can expand to horizontal microcode looking uops in a single decode stage.

Having coded all three (vertical ucode, horizontal ucode, and RISC decode stage HDL) the biggest difference I've found is that RISC tends to be three address, and CISC vertical microcode to be two address. I think that comes from the same place as assuming the presence of an I$. The gate count niche that lets you have a large SRAM block for cache also let's lets you have a large SRAM block for your register file. You're therefore not as dependent on a single accumulator or two tightly coupled with your ALU like earlier CISC designs had to encourage.


Three-address designs trade off a larger insn word for fewer MOV insns and a more straightforward use of the register file. ("Compressed" insns are often two-reg special cases of a three-reg insn.) It's true that RISC tends to feature a higher amount of fully-general registers (as opposed to e.g. separate "data" and "address" registers) but that seems to be a separate development, possibly relating to the load-store approach.


They're all related. When you have an accumulator based design with your ALU and registers, that necessitates a two address format. Or else you're not accumulating. If you don't have space next to your ALU except for one register, you're stuck with an accumulator.


> but there really is some truth to that, isn't there?

There was an Arm version of AMD's Zen, it was called K12, but it never made it to the market since AMD had to choose their bets very carefully back then.


Isn’t amd still hamstrung by their size relative to their competitors? At least in different ways than 5 years ago. The latest examples of amd having to choose carefully include them having to choose what products to focus on with their agreed upon capacity at tsmc last year (more zen chips to make gains ops intel, more gpus to make market share gains vs nvidia and or dedicating as much capacity to console apus for Sony and Microsoft. Also seems to be the case when they tried to make a budget gpu and had to settle with strapping laptop gpus to desktop pci cards and suffering heavy derision from gamers because, as I understand from people familiar with the way chips are designed, a design for a new die can cost tens of millions of dollars and amd, unlike intel and nvidia, are still too small to just crank out new costly designs at will.


Chip design cycles are long enough that you could argue they're hamstrung by their size 3 years ago.


Some but not the whole truth.

It's a sort of RISC core, but count the number of micro-ops some things translate to, it appears that some of them are pretty chunky.


The short answer to David Albert's original question is much simpler than the article ... very little is actually new. We have lots of features that have migrated from "big" computers into PCs and SoCs, but very little that a big-machine programmer from 40-50 years ago wouldn't recognize.

Multi-level caches, TLBs, out-of-order execution, NUMA memory, SIMD, multi-threading - were all normal features of big machines (Crays, Cybers, etc) and are now normal on single chip machines. We have various optimizations and special case handling (eg branch prediction) but these are not normally directly visible to programmers.

GPUs (very large specialized coprocessors) were not a feature of older computers; and their capacity routinely dwarves the "main" CPU.

At a lower level, the routine use of high speed serial comms is new-ish, or at least a reinvention. It was used in very early computers, but use of serial comms for PCIe devices and SD cards is now routine. Perhaps it's just an optimization, but it was not usual a few decades ago.

One thing that I don't recall seeing on older computers is the multitude of fine-grained performance counters now available - useful for fine-tuning, and for peeking at what's going on outside your own process!!


my favourite addition since the 80s has been the unrelenting, unquestioned, ubiquitous and permanent inclusion of numerous iterations of poorly planned and executed management engine frameworks designed to completely ablate the user from the experience of general computing in the service of perpetual and mandatory DRM and security theatricality. the best aspect of this new feature is that not only is your processor effectively indifferent from a rented pressure washer, but on a long enough timeline the indelible omniscient slavemaster to which your IO is subservient can and will always find itself hacked. One of the biggest features missing from 80s processors was the ability to watch a C level cog from a multinational conglomerate squirm in an overstarched tom ford three piece as tech journalists methodically connect the dots between a corpulent scumbag and a hastily constructed excuse to hobble user freedoms at the behest of the next minions movie to arrive finally at a conclusion that takes said chipmakers stock to the barber.

oh and chips come with light up fans and crap now but theres no open standard on how to control the light color so everyone just leaves it in Liberace disco mode so its like a wonderful little rainbow coloured pride parade is marching through my case.


Is that a GPT3 generated text ? It must be.


For once I think it's just an honest to god drug addled heartfelt rant.


I'm hearing an impassioned cry from someone who is devastated by the sequence of events that lead to modern computing being a system to control the masses instead of a system where the masses were free.

It used to be that you could do almost anything with impunity on the internet, trade books and mp3's, bitch on forum posts anonymously and anything less than the CIA getting a federal warrant (which they never did because even if they could see what you were doing it was obvious you were just being an edgy teen) would mean that your bullshit would be hidden and lost to the annals of time.

Now if you post something edgy on Reddit and someone takes offense to it then they can track down your twitter and insta and fb and swat your house and get you fired and turn you into a social pariah.

Hopefully this is obvious hyperbole, just saying that I get where they're coming from. From before the Matrix made the internet seem like it was cool and full of vigilante Jesus bullet-dodging ninjas with infinite guns and everyone wore black leather and hacked into the mainframe. There was a time when the internet was mostly text and pretty much sucked but that was still amazing and beautiful and now it's an all out slugfest between 5 billion people all bickering and posing for points, fame and fortune from the other 5 billion humans on the same 5 websites while corporations spur the up and comers into greater feats of dickishness in order to garner attention that can then be used to possibly sell some random cream made by child labor in Morocco to fight off a butt wrinkle or something stupid like that that doesn't matter.

The internet is an amazing thing and computers are amazing but this ship did not end up where its original captains were sailing it and no one knows if we could ever turn back.


Reading this feels like I'm reading a book in a dream


>oh and chips come with light up fans and crap now but theres no open standard on how to control the light color so everyone just leaves it in Liberace disco mode

This is why I continue to pay a premium for workstation/server grade hardware even when I'm assembling the system myself.


You are preaching to the choir. People don't care about security. They just want to see their Netflix 320x240 px movie marketed as HD.


check out coolero, it's foss for controlling cpu coolers that can handle lighting for a lot of them https://gitlab.com/codifryed/coolero


Does it avoid setting off Epic’s garbage EAC software?

Most of the other lighting software sets it off because for whatever reason, they appear to do something with running processes or windows registry or something that causes EAC to panic and crash my computer. The default ASUS motherboard lighting controls set it off because apparently they aggressively reach into the running game binary in order to set the lighting appropriately.


I love this small hn moment where somebody complaints about

A) threats to freedom and privacy of computing

B) that it is difficult to control the lights of cpu coolers

And there will be very helpful responses fixing B).


You will love it more later when you will see that this is ' think of the children " moment. First they control the CPU, then the OS and then they control you.


And here I am lead to believe openrgb was consolidating all of these Voss attempts to get all rgb working in one application


this has a beautiful stream of consciousness quality to it. do you write somewhere?

agreed on all points too


Are you alright


He's pining for the freedoms.

For a while it seemed that senator-level freedom would be available to the great unwashed, but it was but a mirage.

Carry on.


Parklife!


One thing I've always wondered: How do pre-compiled binaries take advantage of new instructions, if they do at all? Since the compiler needs to create a binary that will work on any modern-ish machine, is there a way to use new instructions without breaking compatibility for older CPUs?


Some compilers have a dynamic dispatch for this; you run the "cpuinfo" instruction and check for capability bits, and then dispatch to the version you can support. Some dynamic linkers can even link different versions of a function depending on the CPU capabilities -- gcc has a special __attribute__((__target__()) pragma for this.


GCC's method is called ifunc https://sourceware.org/glibc/wiki/GNU_IFUNC

The upside is your program may get updated, accelerated routines if it is dynamically linked to a library that you update. The downside is the calls are always indirect via the PLT, which isn't very efficient. It is suitable for things like block encryption and compression where the function entry latency is not very large compared to how long the function runs. It is not very suitable for calls that may be extremely short, like memcmp.


This is where dynamic linking can play a role. For example, Apple's Accelerate framework uses some undocumented matrix instructions on the M1. If you dynamically link the framework (the only supported linkage) you'll automatically get some benefit from these instructions, and future ones, even if those instructions did not exist when you compiled your app.


You’d need to recompile the binary to take advantage of new instructions.

The compiler alone, but also the code, can create branches where the binary checks if certain instructions are available and if they are not, use a less optimal operation.

Backwards compatibility for modern binaries basically. But not forward ability to see the future instructions that haven’t been invented yet.

Not all binaries are fully backwards compatible. If you’re missing AVX, a surprising number of games won’t run. Sometimes only because the launcher won’t run, even though the game plays without AVX.


I've actually sometimes seen this as an argument in favor of JITed languages like C# and Java, that you can take advantage of newer CPU features and instructions etc. without having to recompile. In practice languages that compile to native binaries still win at performance, but it was interesting to see it turned into a talking point.


JIT languages still have a bit of a trade off.

But for a pure pre-compiled example there is Apple Bitcode which is meant to be compiled to the destination architecture before run (as opposed to JIT code which is compiled when run). It's mandatory for Apple watchOS apps and when they released watch with a 64bit CPU they just recompiled all the apps.


I believe that the binaries that actually get shipped to the watch are final bits. Apple just compiles the bitcode that developers give them into versions for all of the watches that they are supporting, and those versions get downloaded by the actual watches.

If they come out with new watches, then they can re-compile all of the code for the new watches with no developer involvement. It is really the best solution for all involved.


D has a thing where you mark a function as @dynamicCompile, so at the expense of carting the IR and a compiler around you can then use whatever instructions the compiler can detect (i.e. it can be compiled on first use)


Generally they don't.


Unfortunately, the answer is usually just to recompile.


I clicked on this link without really reading the title. Then about halfway through I started getting this really exciting feeling, like wow, who is the person who's writing this technical poetry? Then I looked up at the omnibar and, of course, it's Dan Luu.


Nice article. Kind of low-hanging fruit though. A comparison between CPUs in 2022 vs CPUs in 2002 would be much more interesting. ;)


Not really low hanging fruit if the last time you studied CPU design was in a university course. I personally found a lot of the information pretty interesting.


The vast expanses of text with no formatting rules always makes it hard for me to follow along. Added some simple rules that make it much easier to read.

p { maxwidth: 1000px, text-align: center, margin-left: auto, margin-right: auto }

body { text-align: center }

MUCH easier (code snippets = toast)


What about reader mode?


> With substantial hardware effort, Google was able to avoid interference, but additional isolation features could allow this to be done at higher efficiency with less effort.

This is surprising to me. Running any workload that's not your own will trash the CPU caches and will make your workload slower.

Consider for example your performance sensitive code has nothing to do for the next 500 microseconds. If the core runs some other best effort work, it will trash the CPU caches, so that after that 500 microseconds, even when that other work is immediately preempted by the kernel, your performance sensitive code is now dealing with a cold cache.


Google's Borg traces contain both cycles per instruction and last-level cache misses for every process in the traced cluster, so if you are interested you can go dive into that.

https://github.com/google/cluster-data


From the top of my head (before reading the article):

caches, pipelining, branch prediction, memory protections, SIMD, floating point at all, hyper threading, multi-core, needing cooling fins or let alone active cooling

I wonder how much I've forgotten


Most of that (possibly all) existed by 1980s. The Z80 in my Spectrum had no heatsink ;)


The last processor designed by hand was Alpha.


All of those already existed by the 80s.


US patent for the technology behind hyper-threading was granted to Kenneth Okin at Sun Microsystems in November 1994


I don't want to dismiss hyper-threading as trite — it's not, especially in implementation, but it is pretty obvious.

Prior to 1994 the CPU-memory speed delta wasn't so bad that you needed to cover for stalled execution units constantly. Looking at the core clock vs FSB of 1994 Intel chips is a great throwback! [1] Then CPU speed exploded relative to memory, as was probably anticipated by forward looking CPU architects in 1994.

With slow memory there are a few obvious changes you make to the degree you need to cover for load stalls: 1) OoO execution 2) data prefetching 3) find other computation (that likely has its own memory stalls) to interleave. On the thread level is a pretty obvious grain to interleave work, if deeply non-trivial to actually implement.

Performance oriented programmers have always had to think about memory access patterns. Not new since the 80s to need to be friendly to your architecture there.

[1] https://en.wikipedia.org/wiki/Pentium#Pentium


CDC 6600 ran ten threads on one processor in a way that seems a lot like the Niagara T1 on paper.


I think the main difference is that the CDC 6600 the PP state barrel would rotate in a regular way constantly multiplexing the execution of the 10 virtual PPs in the same hardware.

Hyper threading is the idea where the multiplexing of multiple threads happens dynamically based on data dependencies / stalls.


Fundamentally they are the same SMT concept though: multiple hardware threads that have separate register (and other) state but share an execution unit.

HT as an implementation tries to give long sequential execution windows to a hardware thread, it's designed to hide occasional long latencies (IO typically), but pays by having a higher switching cost.

A barrel processor trades off individual instruction latency for a higher degree of parallelism and improved latency hiding.


And between the two was HEP, by Burton Smith.


The MTA architecture developed by Tera Computing came several years earlier (later acquired by Cray). It was arguably a purer expression of the concept. And barrel processing concurrency mechanics can be found on much older architectures.


Other way around: Tera acquired Cray and changed to its name.


https://en.wikipedia.org/wiki/Tomasulo_algorithm was invented in the 1960s, although it did take a while for OOO to make it into the consumer market.


The riddance of segmented memory.


The alternative wasn't that great either. Having just 16 address bits allowed you 64 kB of data memory and code memory that were stuck into same RAM area is a lot worse alternative (for example 8051 was like that). Or 65816 style banks, ugh.

If you had to have just 16 address bits, having code (CS), stack (SS), data (DS), extra (ES), etc. segments was actually pretty nice. Memory copying and scanning operations were natural without needing to swap bank in the innermost loop.

Of course if you could afford 32-bit addressing, there's no comparison. Flat memory space is the best option, but I don't think it came for free.


The segmented memory may come back to provide a cheap way to sandbox code within a process.


Have you heard of this newfangled device called a Graphics Processing Unit and "VRAM"?


That's not "segmented memory".

Segmented memory was a 16-bit processor trying to access a 32-bit space (good grief, I forgot the details).

You'd have "segment registers", such as "ds" or "es". Reading/writing to memory was by default to the "ds" segment (a 64kB region). If you wanted to read/write to a different segment, you needed to make sure that es, or fs, or one of the other segment registers were available.

If you look at old Win32 code, you might see "far pointer" and "near pointers". "far pointers" were 32-bit pointers that would load the appropriate segment register, while "near pointers" were 16-bit pointers that only made sense within certain contexts.

It was a nightmare. "Flat Mode" eventually became the norm when processors had enough space to just store full sized 32-bit pointers (and later, 64-bit pointers). "Flat mode" made everything easier and now we can forget that those dark ages ever existed...


Yeah, but that's GPU, not CPU. Hopefully we will see similar progress there in the next 40 years.


I'm going through Understanding Software Dynamics by Richard Sites now, and it's the first book I've read that covers the practical performance implications of some of these new features, even if briefly.


What about wide/multiple issue/superscalar, deep pipelining, speculation, multicore (only mentioned briefly), load/store (RISC) architectures (including internal implementations of x86), heterogeneous designs, SoC designs, and various types of accelerators in addition to GPUs (media accelerators, ML accelerators, smart NICs...)

Though probably few of those things are technically new since the 1980s.


I hope they invent the CSS instruction in the next gen CPU so that Dan can start using it in his blog.


2015? I think there’s a date missing on the page.


Near the end in one section the the author provides an update and refers to it being 2016, a year since they wrote the article.


Most CPUs aren’t based on 8086? From phones and chromebooks, to cloud and up to the biggest super computers: ARM is there.


8086 is the USA of computers.


"My first computer was a 286. On that machine, a memory access might take a few cycles. A few years back, I used a Pentium 4 system where a memory access took more than 400 cycles. Processors have sped up a lot more than memory"

I remember seeing an add for Power which advertised memory speed at half the processor speed. I think only consumer CPUs have slow memories.


My Ryzen 9 is just a bit more performant than the MOS6502 my Apple ][ was rocking back in the 80s.

Ok... it also has things like a built-in FPU, multiple cores, cache, pipelining, branch prediction, more address space, more registers, and manufactured with a significantly better (and smaller) process.


Yes, but look at how fast it runs the antivirus /s




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: