2015. A good exercise would be "What's new in CPUs since 2015?" A few I can thin...

dragontamer · on April 22, 2022

L3 caches have grown monstrously.

The new AMD Ryzen 5800x3d has 96MB of L3 cache. This is so monstrous that the 2048x entry TLB with 4kB pages only can access 8MB.

That's right, you run out of TLB-entries before you run out of L3 cache these days. (Or you start using hugepages damn it)

----------

I think Intel's PEXT and PDEP was introduced around 2015-era. But AMD chips now execute PEXT / PDEP quickly, so its now feasible to use it on most people's modern systems (assuming Zen3 or a 2015+ era Intel CPU). Obviously those instructions don't exist in ARM / POWER9 world, but they're really fun to experiment with.

PEXT / PDEP are effectively bitwise-gather and bitwise-scatter instructions, and can be used to perform extremely fast and arbitrary bit-permutations. I played around with them to implement some relational-database operations (join, select, etc. etc.) over bit-relations for the 4-coloring theorem. (Just a toy to amuse myself with. A 16-bit bitset of "0001_1111_0000_0000" means "(Var1 == Color4 and Var2==Color1) or (Var2==Color2)".

There's probably some kind of tight relational algebra / automatic logic proving / binary decision diagram / stuffs that you can do with PEXT/PDEP. It really seems like an unexplored field.

----

EDIT: Oh, another big one. ARMv8 and POWER9 standardized upon the C++11 memory model of acquire-release. This was inevitable because Java and C++ standardized upon the memory model in the 00s / early 10s, so chips inevitably would be tailored for that model.

seoaeu · on April 22, 2022

> That's right, you run out of TLB-entries before you run out of L3 cache these days.

This is more reasonable than it sounds. A TLB miss can in many cases be faster than a L3 cache hit

jeffbee · on April 22, 2022

It's also misleading because it has 8 cores and each of them has 2048 l2 TLB entries. Altogether they can cover 64MiB of memory with small pages.

dragontamer · on April 22, 2022

But 5800x3D has 96MB of L3. So even if all 8 cores are independently working on different memory addresses, you still can't cover all 96MB of L3 with the TLB.

EDIT: Well, unless you use 2MB hugepages of course.

jeffbee · on April 22, 2022

That's another thing which is recent. Before Haswell, x86 cores had almost no huge TLB entries. IvyBridge only had 32 in 2MiB mode, compared to 64 + 512 in 4KiB mode.

dragontamer · on April 22, 2022

Are you sure? TLB misses mean a pagewalk. Sure, the directory tree is probably in L3 cache, but repeatedly pagewalking through L3 to find a memory address is going to be slower than just fetching it from the in core TLB.

I know that modern cores have dedicated page walking units these days, but I admit that I've never tested the speed of them.

seoaeu · on April 22, 2022

It only takes ~200KB to store page tables for 96MB of address space. So the page table entries might mostly stay in the L1 and L2 caches

dragontamer · on April 23, 2022

I think you made an error in your assumptions.

Each 64byte cache line could feasibly come from a different page in the worst case.

I think modern processors actually pull 128 bytes from RAM at the L3 level, if each 128 L3 cache line is from a different page, that's 768k pages in the 96MB L3 cache.

That being said, huge pages won't help much in this degenerate case. So your assumption might be valid for this argument actually.

So maybe it's not that much of an error.

seoaeu · on April 23, 2022

My estimate is for a small number of contiguous regions. It is true that if you adversarially construct a set of cache lines, you might need a far larger amount of memory to store page tables for them. Whether you consider that an "error" or just a simplifying assumption is a matter of opinion I suppose

jandrewrogers · on April 23, 2022

PDEP/PEXT were part of the Intel Haswell microarchitecture, launched in 2013.

And yes, they can be extremely useful for efficient join operations in some contexts, that would be challenging to implement without those instructions. Also selection for some codes. Not everyone needs them, but when you need them you really need them. And those use cases are frequently worth it. I use them to implement a general algebra, much like you suggest.

titzer · on April 22, 2022

Spectre. It was a vulnerability before 2015, but not known publicly until early 2018. It's hugely disruptive to microarchitecture, particularly with crossing kernel/user space boundaries, separating state between hyperthreads, etc.

code_biologist · on April 23, 2022

100%. Feasible timing attacks mean we must look at all speculative execution with suspicion. But dang that's a lot of performance to give up.

flakiness · on April 22, 2022

Big.Little-like architecture? Even intel has adopted that in their 12 gen.

I believe a lot has happened around mobile and power as well. Apple boasts their progress every year, and at least some of them are real. But they are too secretive to talk about that. I hope some competitors have written some related papers. For example, the OP talks about dark silicon. What's going on around it these days?

jcranmer · on April 22, 2022

Intel PT is another thing that's worth calling out since 2015 (see the other article on the front page right now, https://news.ycombinator.com/item?id=31121319, for something that benefits from it).

It does look like Hardware Lock Elision/Transactional Memory is something that seems like it will be consigned to the dustbins of history (again).

jeffbee · on April 22, 2022

Intel did not ship even one working implementation of TSX, so it's not like anyone will be inconvenienced that they cancelled it.

bcrl · on April 23, 2022

A number of companies invested in doing the software development to take advantage of TSX (as the performance improvements helped databases by companies like Oracle), so Intel certainly lost a lot of credibility. And Intel is jerking software developers around again with the latest vector instructions that keep getting turned off in desktop / laptop SKUs and only being available on servers. Intel has done quite poorly over the past 5+ years on this front.