AMD and Jedec Are Collaborating on DDR5 MRDIMMs at 17,600 MT/S

jiggawatts · on April 6, 2023

A shower-thought that popped into my brain this morning is that with ~100ns latency to main memory, that's the equivalent of "only" 10 million IOPS (per channel). I say only, because my laptop has an NVMe SSD that can do nearly 1 million IOPS.

If you have 96 CPU cores per socket, and only 6 channels per socket, that's a "mere" 625K IOPS per core. Up that to 128 cores (coming soon), then it's just 470K IOPS per core! That's worse in some sense than a laptop SSD for a single-threaded program. Not directly comparable, of course, but you can see why AMD would want to bump this number up.

For comparison, the equivalent IOPS for L2 cache is a blistering 260 million IOPS, and L1 cache is 850 million.

I see now why there's the phrase "memory is the new disk" is starting to get popular.

Tuna-Fish · on April 6, 2023

> A shower-thought that popped into my brain this morning is that with ~100ns latency to main memory, that's the equivalent of "only" 10 million IOPS (per channel). I say only, because my laptop has an NVMe SSD that can do nearly 1 million IOPS.

That is not how any of this works. Neither RAM nor SSD can get anywhere near their peak throughput if accesses are serialized. If your laptop has DDR5, it can likely do > 1B iops. 100ns is the time to get a single random access to a closed row, but while you are waiting for that access to happen, you can issue new operations every 3ns to different banks and bank groups on the same channel.

_a_a_a_ · on April 6, 2023

I don't know how any of this works, and there seem to be a number of experts in this thread so please allow me some very newbie questions.

> [can't] get anywhere near their peak throughput if accesses are serialized

But first of all, doesn't DRAM have an optimisation for serial accesses, which is (I hope I get this the right way round) to hold the RAS steady then do CAS/CAS/CAS/CAS instead of RAS/CAS + RAS/CAS + RAS/CAS + RAS/CAS?

In addition, if you do access RAM serially then there is some interleaving which is automatically inserted so that sequential accesses at some level are sent off to different banks (but I've never been clear if there interleaved at a 64 bytes, to fit a cache line, or at a 4K page size to suit the TLB, or some other size, and I'd really like to know – anyone please?)

Tuna-Fish · on April 6, 2023

Sorry, two entirely different definitions of serial, and mine is less useful.

What I mean is that if you are doing an infinite pointer hop (that is, load a value, then use the that value as a pointer to load another value, use that as a a pointer... etc), you only get a tiny fraction of the IOPS you get than if you launch 100 different loads to different addresses that you already know. And the inverse of your load to use latency is basically how many IOPS you get if you do that infinite pointer chase.

> In addition, if you do access RAM serially then there is some interleaving which is automatically inserted so that sequential accesses at some level are sent off to different banks (but I've never been clear if there interleaved at a 64 bytes, to fit a cache line, or at a 4K page size to suit the TLB, or some other size, and I'd really like to know – anyone please?)

They are interleaved at the size of the DRAM row, which is usually 2^16 bits, or 8192 bytes. The reasoning here is that if you are launching a ton of linear accesses, that is, doing CAS/CAS/CAS/CAS... to the same row, you can get full throughput from it. This is because when you have opened a row, you are not reading from the DRAM anymore, the entire row is in a SRAM buffer, and can read full interface throughput from it. Then you only have to open a new row once the current one runs out, which is always in a different bank so can be done in parallel with the read operations from the current row so long as your memory prefetchers are smart enough to start doing it early enough.

_a_a_a_ · on April 6, 2023

Thanks that's really good, and the interleaving question's been nagging me a long time. There's a lot of depth here when you start asking. Much appreciated.

pclmulqdq · on April 6, 2023

Yeah, DRAM controllers optimize for serial access in the mapping from physical address space to DRAM DIMMs, banks, rows, and columns. There's actually a pretty sophisticated algorithm for mapping "physical" addresses to DRAM addresses.

riku_iki · on April 6, 2023

I am curious how this all maps on software. Is it kernel which will will try to do many parallel RAM reads? What is the state if it, and are there any benchmarks..

jiggawatts · on April 7, 2023

It's done almost entirely in the CPU's memory controller. The operating system only controls the page table entries, which is a "high level" concept a layer of abstraction above this.

jlokier · on April 6, 2023

Your NVMe SSD gets 1 million IOPs with overlapping requests. That means each new request issued does not wait for the previous one to finish. There's a queue depth, and keeping it full enough is needed to reach 1 million IOPS throughput. The latency will be higher than 1µs for each request.

RAM also does overlapping requests and has a queue depth, so that's a reasonable comparison. The headline of this article says they are standardising RAM with a throughput of 17600 million IOPS.

If you have 96 CPU cores per socket, and 6 channels per socket going to separate MRDIMMs, that's 1100 million IOPS per core. At 128 cores, 825 million IOPS per core.

So, this new RAM standard is 825-1100 times faster throughput than your laptop SSD, when counted in IOPS (without regard to the data size).

If you want to compare latency instead of throughpt, the IOPS measurement for that occurs when the sender waits for each request to complete before issuing the next one. In that case you're right that it's will be much lower IOPS per CPU core, but this also makes it lower IOPS for the SSD so that comparison still favours RAM.

Each CPU core is independent, so each core would likely wait for its own previous request before issuing the next one (this actually happens when a CPU core is following a linked list), but the cores are doing this independently of each other, in parallel.

So with 96 CPU cores per socket, each core waiting for its own request to complete before issuing the next one, and assuming ~100ns latency, that's 960 million IOPS total, and 10 million IOPS per core. With 128 CPU cores it's still 10 million IOPS per core.

MrFoof · on April 6, 2023

Moreover, at least last time I played around with it on reasonable hardware (3rd gen EPYC), your first real limitation is your operating systems’ IO scheduler, not the hardware — at least on a “single node.” At least a year and change ago, you started slamming into a wall around 15-16M random IOPS (from your backing store). Wasn’t CPU bound. Wasn’t memory bound or a NUMA issue. Wasn’t an issue of PCIe lanes. 100% the IO scheduler as dtrace very much laid that bare.

We’re very much in a situation where the software is catching up to the hardware. For example, look at the proposed NEST Scheduler for Linux where they were seeing 10-100% (yes, as much as double!) performance improvement with a 10-15% reduction in power consumption, mostly by focusing on keeping, “hot cores hot.”

IO scheduler and process scheduler improvements will offer truly material gains in coming years, even if you still use the same high-end hardware you have today.

jiggawatts · on April 7, 2023

I love the responses in this thread. Turned out that my "Fermi estimate" was wrong, by about a factor of 100x.

A simple back-of-the-napkin maths is that for a system with 30GB/s memory bandwidth and 32-byte cache lines works out to about a billion I/O operations per second. This would be a typical laptop system, or the like.

However, 100ns per read is still a valid scenario for a single-threaded program "chasing pointers" as in a linked-list.

I think it's still fair to say that an un-optimised single-threaded program accessing memory randomly is only 10x faster than a parallel program efficiently utilising a modern NVMe SSD.

The corollary is that there is (still) no scenario where an SSD would "beat" the performance of RAM since the worst-case for RAM still beats the best-case for SSD.

imtringued · on April 6, 2023

People don't seem to understand that IOPS is a meaningless measure from a practical perspective.

It is just a weird way to express throughput. An SSD that loads 4KB at a time has 64 times less IOPS than RAM that loads 64 bytes at a time assuming both have the same bandwidth.

Dylan16807 · on April 6, 2023

The average fetch is not that big. IOPS and GB/s are both very important numbers.

From a practical perspective, the IOPS of RAM is critical for running programs. And if you're doing anything other than big file transfers, an SSD plugged into USB 2.0 at 35MB/s will beat a hard drive on USB 3.0 at 200MB/s.

Even better is IOPS at queue depth 1, IOPS at queue depth 16 or 32, and max bandwidth.

adwn · on April 6, 2023

> with ~100ns latency to main memory, that's the equivalent of "only" 10 million IOPS (per channel)

I think you're mixing up latency and throughput. IOPS is throughput, and any single CPU core can have several RAM accesses in flight, not just one. So with DDR4-3200 you have more than 3 billion "IOPS" per channel, not 10 million as in your calculation.

Tuna-Fish · on April 6, 2023

Not quite so many. Each memory operation on modern CPUs is a minimum of 64 bytes, so on DDR4-3200 if you saturate a single 64-bit channel, you get 400 million IOPS.

On DDR5, channels are 32 bits wide, but there are two in a DIMM, so a normal desktop system is actually "quad-channel", with each channel doing up to that 400 million IOPS if the ram speed is DDR5-6400.

In practice, it's of course rare to completely saturate channels, just because of bank conflicts and refreshes, etc.

acquacow · on April 6, 2023

IOPS are not throughput at all... throughput can't even be calculated until you factor in the size of the I/O request.

You can have 1M IOPS at 5MB/sec if your blocks of data are small enough, like 8k... vs 100k IOPS/sec at 1GB/sec throughput with 1MB request sizes...

In this case 100k IOPS are getting you more throughput than your 1M IOPS.

adwn · on April 6, 2023

> IOPS are not throughput at all

Yes, it is. You're thinking of "bandwidth"; the terms "throughput" and "bandwidth" are not interchangeable. IOPS and bandwidth are two different forms of throughput: the former has number of operations in the numerator, the latter has bytes.

CHY872 · on April 6, 2023

While it’s true that if you were e.g. traversing a random linked list doing pointer chasing, that might be true, it should be noted that the effective number of operations is significantly improved by the prefetcher. For example, on an i7 9700k, memcpy can copy at 13.7GB/sec, which is probably 400M operations in practice.

brilee · on April 6, 2023

Today, compute speed is roughly 5-6 orders of magnitude faster (single threaded) than it was 30 years ago. Today, RAM is roughly 5-6 orders of magnitude faster than HDD was 30 years ago.

All the old-school HDD-optimization techniques from traditional database programming, etc. are now fully relevant again for memory management.

monocasa · on April 6, 2023

> Today, compute speed is roughly 5-6 orders of magnitude faster (single threaded) than it was 30 years ago

Wat? No it's not. It's closer to 5x-6x than it is to 5 orders of magnitude if you're looking at instructions the CPU normally executes.

> All the old-school HDD-optimization techniques from traditional database programming, etc. are now fully relevant again for memory management.

Those techniques were mainly managing the fact that it took 100s of milliseconds to random seek (like, human perceptible times to do one seek) so the algorithms tried hard to minimize that. It doesn't have much bearing on RAM with something like 100ns random access.

dragonwriter · on April 6, 2023

> It's closer to 5x-6x than it is to 5 orders of magnitude

1993 was when the original 60-66Mhz Pentium first shipped; 5 orders of magnitude may be an exaggeration, but every step since then has had of increases beyond just clock speed, and clock speed alone has increased nearly two orders of magnitude.

rasz · on April 6, 2023

Memory latency went down mere 2x between SDR and DDR around 2000, and then stayed flat for the last 20 years:

https://en.wikipedia.org/wiki/CAS_latency#Memory_timing_exam...

dragonwriter · on April 6, 2023

> Memory latency

The widening gap between memory latency vs speed of processing in the processor (and how that arguably makes access optinizations that used to make sense between processor and HDD sensible between processor and main memory) is the issue that was being discussed, so memory latency can't be counted as part of processor speed in that context.

monocasa · on April 6, 2023

And I still reckon that it's closer to 5-6x than it is to 5or6 orders of magnitude.

dragonwriter · on April 6, 2023

> And I still reckon that it’s closer to 5-6x than it is to 5or6 orders of magnitude.

The geometric midpoint (the only one that makes sense here) between the high end of 5-6x and the low end of 5-6 orders of magnitude is about 774×.

The best info I can find (unsurprisingly, direct comparisons of performance of long-separated-in-time processors in aggregate is hard to find) is that between 1995-2011, integer performance increased somewhere well over 128 times but less than 256 times and floating point significantly more than 256x, and was trending in the last several years of the period to increase at a pace of 21% per year for both. [0] Seems likely to be close to or significantly past the 774× level between 1993-2023.

[0] https://preshing.com/20120208/a-look-back-at-single-threaded...

Retric · on April 6, 2023

The access patterns are very different which makes old optimizations less useful.

For example HDD latency was very sequential with the heads stacked on top of each other. Meanwhile modern memory has has latency for a single request but you can make several requests at the same time and latency doesn’t stack linearly.

AnimalMuppet · on April 6, 2023

The comparison is about speed, not total capacity. Sure, memory hasn't scaled like that. But in terms of CPU stalling, it's effectively the same as disk was.

Retric · on April 6, 2023

I edited the point a few times trying to get the idea across, but the point was one of access patterns.

Modern RAM allows paralleled access where HDD effectively only had a single read/wrote head. People also on average do a lot more computation on any given bit of memory.

rbanffy · on April 6, 2023

> All the old-school HDD-optimization techniques

Kind of. With memory, at least, we don't need to do elevator seeks.

ilyt · on April 6, 2023

Each block of CPUs have it's own controller so your math is already wrong.

But yeah, random access to memory is bad, you can only get to the max performance if each thread requests for somewhat contiguous memory blocks

Dylan16807 · on April 6, 2023

> Each block of CPUs have it's own controller so your math is already wrong.

Wrong how? The internal structure doesn't matter very much. Channels per socket and cores per socket, to calculate channels per core, is the correct math.

smolder · on April 6, 2023

NVME drives made for PCIe gen 4 and 5 use DRAM to get their high burst read and write rates, apart from optane devices. Most in gen 3 did as well, the 970pro being an exception.

foota · on April 6, 2023

I don't think this is accurate, because the memory protocol doesn't require you to wait for each IOP, it's much more complicated than that.

dragontamer · on April 6, 2023

That's why there's parallel banks and parallel bank-groups.

Any _particular_ RAS + CAS to read a 64-burst will be ~100ns latency. But you can perform 16 of them in parallel on DDR4 (probably 32?? in parallel on DDR5?).

That is: Bank-group#0-Bank#0 RAS -> Bank-group#1-Bank#0 RAS -> ... Bank-group#4-Bank#0 RAS -> Bank-group#0-Bank#0 RAS ... Bank-group#4-Bank#4 RAS -> Bank-group#0-Bank#0 CAS (finally finish the 1st read/write)... etc. etc.

DDR4 sticks are designed to only be fully utilized when you have 16+ concurrent RAS/CAS operations going all at once.

-------------

If you have two sticks on a line of DDR4, you have a "rank" as well. So Rank#0-Bankgroup#0-Bank#0 -> Rank#1-Bankgroup#0-Bank#0... (etc. etc.) 32x in parallel (alternating between Stick#0 and Stick#1).

DDR4 sticks already can be multi-rank. So this MRDIMM seems to be another layer of parallelism (and DDR5 upgrade was probably another layer of parallelism again, because 100ns latency doesn't seem to be improved no matter how many years go by).

Dylan16807 · on April 6, 2023

Note that DDR5 doubled the number of channels per module. That compensates for a lot of CPU growth, in addition to those big CPUs getting more channels over the last few years.

re-thc · on April 6, 2023

Intel currently has a max of 8 channels (for 60 cores) and AMD 12 (for 96 cores) on the server side.

Not to discount what you've said but it's slightly better :p

sbaiddn · on April 6, 2023

multiple socket multi-cpus solve this, right?

brucethemoose2 · on April 6, 2023

I know this for server CPUs... but imagine a GPU with these DIMMs for VRAM. It could hold a huge model without the headache of splitting it up.

Thats basically what Intel Falcon Shores is, and I am sure AMD is cooking up something similar behind the MI300.

kmod · on April 6, 2023

I'm a bit rusty on the details, but I think the general reason that GPUs have less memory than you want is because they are typically optimized for throughput, which limits capacity in a couple ways. One is higher power consumption, and another is low max distance between processors and memory.

We already have the ability to put far more memory in GPUs than they currently have, but we don't because it hurts throughput. I don't think technical advances will change the existence of a throughput/capacity tradeoff, but I do think it's an interesting idea that maybe DL models benefit from a different tradeoff than graphics applications. Personally I'd guess that's a larger factor here than incremental changes in capabilities, because the two types of memory are both improving at the same time. But who knows! Maybe the technologies will converge enough that it will no longer make sense to have two different variants, though I wouldn't bet on it.

brucethemoose2 · on April 6, 2023

Modern "consumer" GPUs have relatively narrow busses, which is a deliberate design choice. But its not about throughput, its about cost and power savings in gaming workloads. Old GPUs like the 290X had a 512 bit bus (like the M1 Max) and 16 GDDRX ICs.

Actual ML GPUs transitioned to HBM, but the capacity is relatively limited. I think Nvidia thought 96GB per GPU would be plenty when they were planning things out years ago, but the explosion of model sizes seems to have taken all the hardware manufacturers by surprise.

Osiris · on April 7, 2023

I have a VEGA 56 with HBM and a 2048 bit bus. It gets crushed by newer cards with far less bandwidth. It was great for crypto mining though.

brucethemoose2 · on April 7, 2023

Yeah I am not saying older cards were better, just that they were capable of holding much more VRAM with no silicon changes.

Tostino · on April 6, 2023

Yeah, i've been incredibly surprised that this hasn't popped up yet. It just seems like a completely logical next step.

favorited · on April 6, 2023

It's been done, just not recently.

https://www.amazon.com/ATI-Rage-Video-Card-109-43200-10/dp/B...

ohgodplsno · on April 6, 2023

Don't get too excited: unless your usecase is about pulling out massive amounts of data organised in a linear fashion out of RAM in a single read, DDR5 is kind of crap for real world usage. Current DDR5-4800 kits (which are pretty much the only things that will work on non-enthusiast hardware) will perform usually worse than a regular DDR4-3200 at most tasks, for a single reason: latency. Most DDR5 these days have a latency of 20ns, compared to 10ns of DDR5. Any bandwidth gains will be destroyed by the fact that you're doing a thousand reads anyways.

However, if your usecase is sequentially reading massive chunks of memory, this will be useful for you. But dear god am I afraid of seeing the latencies on these beasts.

jeffbee · on April 6, 2023

I am not sure that is accurate. I just looked at the inventory of my local computer parts store. They are selling Crucial DDR4-3200 DIMMS with CL22, and Crucial DDR5-4800 with CL40. DRAM latency is measured in clock cycles, so this difference is only +21%, not +100% as you suggested. Same for DDR5-5600 with CL46. The CAS latency is higher in clock terms but not in wall-clock terms.

ohgodplsno · on April 6, 2023

(latency = CL * 2000 / hz)

CL22 is also pretty damn bad, most DDR4-3200 kits now have CL16, which makes them 10ns. But even then, your DDR4-3200CL22 has 13.75ns, your DDR5-5600CL46 has 16.5ns. and that is assuming your PC will handle a 5600, which isn't the JEDEC standard, so most OEMs can say goodbye to it.

jeffbee · on April 6, 2023

Oh really? It is the supported memory type written on the box of my Intel processor.

ohgodplsno · on April 6, 2023

Once again, the subject is not you buying your processor individually (enthusiast hardware), but your run of the mill, 80% of people have it OEM stuff. Your tower PC bought at the local supermarket, your laptop, all of these will have a combination of:

* CPU not supporting such a high rate

* Motherboard not supporting it

* CPU lost the silicon lottery and has a bad memory controller that struggles above basic rates

* vendor put in DDR5-4800 in anyways to save up on costs

* BIOS not up to date on your motherboard

jeffbee · on April 6, 2023

But what you said is, it’s not a standard. And I’m looking at a press release from 2021 that says it is.

> JESD79-5A expands the timing definition and transfer speed of DDR5 up to 6400 MT/s for DRAM core timings and 5600 MT/s for IO AC timings to enable the industry to build an ecosystem up to 5600 MT/s.

ohgodplsno · on April 6, 2023

Jesus fuck, thanks for reminding me that HN is full of needlessly pedantic assholes.

JEDEC _default_, not standard. Satisfied ? As in, the default value is 4800, and now, with JESD79-5A, JEDEC also recognizes that DDR5-6400 can exist. Not "is the standard".

jeffbee · on April 6, 2023

Hey if you don’t want to have this conversation, stop repeating stuff you read on Reddit but don’t understand.

dawidpotocki · on April 6, 2023

How do you define "non-enthusiast hardware"? All consumer AMD CPUs should be able to clock up to 6000 MT/s and Raptor Lake should do 6600 MT/s stable at the very least with Hynix M-dies. 5600 MT/s is not even an overclock on Intel CPUs.

Dalewyn · on April 6, 2023

>How do you define "non-enthusiast hardware"?

That OEM motherboard in your big box tower.

>All consumer AMD CPUs should be able to clock up to 6000 MT/s and Raptor Lake should do 6600 MT/s stable at the very least with Hynix M-dies. 5600 MT/s is not even an overclock on Intel CPUs.

DDR5 speeds currently crash like a rock if there's more than one DIMM per channel, even moreso if those DIMMs have two ranks rather than one.

I've got four sticks of 16GB DDR5 5600MHz in my Intel i7 12700K, Z690 system. The best I can get the RAM going is 4800MHz, which is still an overclock because Intel's memory controller goes down to 4000MHz rated clock speed with this setup (two DIMMs per channel, 1 rank per DIMM).

csdvrx · on April 6, 2023

> DDR5 speeds currently crash like a rock if there's more than one DIMM per channel, even moreso if those DIMMs have two ranks rather than one.

How would you determine that with dmidecode?

dawidpotocki · on April 6, 2023

Yes, that currently doesn't work well, but most people are using 1DPC, not 2DPC. You can't really even find a kit with 4 DDR5 sticks.

vid · on April 6, 2023

>You can't really even find a kit with 4 DDR5 sticks.

https://www.corsair.com/ww/en/Categories/Products/Memory/VEN...

4 X 48GB DDR5 5200 40CL latency. G-Skill just came out with some new kits that are even faster.

Dalewyn · on April 6, 2023

>You can't really even find a kit with 4 DDR5 sticks.

You can: https://www.newegg.com/corsair-64gb/p/N82E16820236933

Maybe once prices come down more I'll consider going to two sticks of 32GB instead of four sticks of 16GB, who knows.

dawidpotocki · on April 6, 2023

At least in my country (New Zealand) only Newegg (which ships from US and has very expensive shipping) seems to have these 4 stick DDR5 kits and only from Crucial and they are all more expensive than a 2 stick kit with the same capacity, so I'm not sure why anyone would get them.

lightedman · on April 6, 2023

Populating all of your RAM slots with matched RAM improves your overall bandwidth. A 2x8GB set will run in dual-channel mode, the 4x4 set will run in quad-channel mode (if the CPU supports it, most times it's dual-channel.) The 4x4 set will allow for more bandwidth vs the 2x8. Not quite double, but it reaches at it. Every other metric doesn't really see much of an improvement, though, so this only helps out in bandwidth-intensive applications.

Dalewyn · on April 6, 2023

In my (and probably most peoples') case I'm running in dual channel, so that's two 16GB DIMMs per channel. This makes the memory controller clock down to 4000MHz rated.

Dalewyn · on April 6, 2023

There weren't any 32GB sticks back when I put my system together, so it is what it is. :V

ohgodplsno · on April 6, 2023

Overclocking _is_ enthusiast behavior, and you can pretty much only do it with enthusiast hardware. As another person told you, if you're buying your components yourself, congratulations, it's enthusiast hardware. Many will still get laptops (that do not allow overclocking), or crappy motherboards from OEMs (that do not allow overclockingà.. The current reality of things is that DDR5-4800 is the default JEDEC speed, and that's it's crap. The same way that DDR4-2133 used to be the default JEDEC speed, and now things got better and 3200 is the default without overclock.

alfor · on April 6, 2023

I thought we were going into chiplets and tighter integrations.

A bit like what Apple did with M1/M2, putting ram/flash/cpu/gpu on the same package with very short, wide and fast interconnects.

Leaving only IO, specialized chips and power outside of the main package.

faeriechangling · on April 6, 2023

DIMMS aren't going anywhere anytime soon even if it's plausible that they're going to get phased out of consumer tier products. There's way too many databases out there that need absurd amounts of memory to run performantly.

Epyc supports 6tb per socket - try getting that much memory on-package any time soon.

rasz · on April 6, 2023

Apple marketing worked great on you. Putting ram on package was to upsell you on a higher more expensive model and prevent life cycle prolonging field upgrades. Anything below 1024bits is just cost cutting and forces market segmentation. Xbox has no problem fanning out 320bit bus in addition to pcie lanes, same for hiend GPUs. HMB Apple was alluding to but carefully not saying out loud starts at 4096bits.

M1/M2 ram is just 128bit like any other normal mobile processor. M1/M2 Pro comes with 256bit bus (2 custom LPDDR5 chips) and 200 GB/s, M1/M2 Max with 512bit.

POP ram makes sure nobody will be able to upgrade, hardware locked recurring revenue.

faeriechangling · on April 6, 2023

I'm pretty excited about the potential of on-package memory to enable LLM's and other AI stuff to run on local hardware with higher performance than using DDR DIMMs and lower cost than shelling out 80 grand for a pair of H100s.

When the arguments for DIMMs are "field upgrades" which is a problem you can avoid by buying more memory to begin with (Also in practice few users actually upgrade memory), and "price segmentation" which isn't really a technical consideration, I see the writing on the wall. At least in the consumer space.

alfor · on April 6, 2023

I agree with you. I think Apple is trailblazing again. Most laptop will go in that direction. Tighter integration will lead to a better performance/cost and better efficiency also.

AMD 3D cache is also going in that direction integrating ram/cache on top of cpu with high density of interconnect.

rasz · on April 6, 2023

You are excited about swapping out somehow making for better performance? Even biggest M2 Max cant fit bigger models. I dont know anyone with older macbook who didnt max out ram as soon as it became financially sensible.

Latty · on April 6, 2023

For consumer devices that don't need much RAM, that makes sense, but for servers, I imagine it'll end up as more of an (optionally) tiered set-up, if you look at the Sapphire Rapids chips with HBM on, they can be run with only that, but they can also be run with it as another cache layer, or as a separate faster memory region.

If you need an absolute ton of memory, having it all directly on the die becomes impractical at some point.

tjoff · on April 6, 2023

Hope not, that means you'd only get mainstream hardware. Incredibly limiting and boring.

formerly_proven · on April 6, 2023

> This is made possible by placing a mux between the memory and the CPU that combines the two 64-bit accesses into a single 128-bit data path for the CPU.

This does not seem to be correct. Modules with an 128-bit data bus would obviously require a lot more pins. Instead what this actually seems to be about is placing a buffer between the two ranks (essentially two completely independent modules) and the data bus (presumably it buffers the command bus as well). The "128 bit" data bus on the host side is achieved by running QDR over the 64 bit wide bus.

Basically an evolution of LRDIMMs with a pinch of FBDIMM.

bogantech · on April 6, 2023

> Instead what this actually seems to be about is placing a buffer between the two ranks (essentially two completely independent modules) and the data bus

AKA a Mux - or Multiplexer

evancox100 · on April 6, 2023

The part that’s inaccurate is calling it a 128-bit data path. It’s not, it’s just a higher speed 64-bit path

formerly_proven · on April 6, 2023

No, multiplexers are more or less just dumb analog switches, though more specialized ones like e.g. PCIe muxes can include redrivers.

bogantech · on April 6, 2023

Multiplexers can be analog or digital

https://en.wikipedia.org/wiki/Multiplexer

> In electronics, a multiplexer (or mux; spelled sometimes as multiplexor), also known as a data selector, is a device that selects between several analog or digital input signals and forwards the selected input to a single output line.

An example of a common digital multiplexer would be the 74HC157 but if you think they're really not multiplexers I'm sure NXP and Texas Instruments would love to hear from you

Multiplexing is a concept anyway, the medium doesn't matter - you can multiplex radio signals, light, sound etc.

_a_a_a_ · on April 6, 2023

I certainly don't know, but I think what the guy is saying is that they are necessarily analog switches here because of the speed they have to operate at.

evancox100 · on April 6, 2023

There is a real difference between an analog and digital multiplexer regardless of speed. For an analog mux, whatever voltage you put in you get out, including mid-rail voltages. For a digital mux, any input < VDD/2 causes an output of 0 V, and any input > VDD/2 causes an output of VDD volts.

You can use an analog mux on digital signals, but it probably won’t be as good as a digital one. However, you can’t use a digital mux to switch analog signals, it won’t work.

Edit: the one exception to this is for bidirectional digital signals you may want to use an analog mux, since you don’t know which side may be driving the lines. This would apply to DRAM since the data lines are bidirectional, so probably using an analog mux.

girvo · on April 6, 2023

In some ways that’s true of a lot of modern digital computing. Kind of wild when you think about it…

_a_a_a_ · on April 6, 2023

It is, and it leads me to wonder what the differences between analogue and digital actually is, by definition. As Latches and a Clock? Serious question.

evancox100 · on April 6, 2023

By definition, an analog signal encodes an “infinite” amount of information in whatever value it is taking on at the time point in question, while a digital signal encodes information in discrete states. For binary that is typically two states, but sometimes you can divide things finer, like multilevel flash cells or PAM4 signaling.

Digital the values “round-to-nearest”, so if supply is 1 V, then 0.6 V, 0.9 V, and 1 V all mean logical 1. Whereas for analog those different values mean different things.

Also, for digital you want to push the signals to the extremes so they are easier to read, while for analog that would be destroying information, you want to preserve the value.

(In practice there are fundamental limits to how much useful information you get out of analog signals due to noise.)

_a_a_a_ · on April 7, 2023

The answer's appreciated, but I'm afraid I didn't ask a very good question. What I was really getting at was, from the perspective of an EE, what would make an expert call one type of circuitry digital and the other analog?

Clearly the output signal has to be 0 or 1, which must mean the signal is boosted or reduced to an appropriate level (assuming we not trying to encode multiple values onto a signal; I'm just talking binary). So I guess the signal has to be captured (latched?) then modified. I don't think a clock is absolutely necessary as there have been asynchronous digital CPUs, but the presence of a clock clock is very typical as well.

evancox100 · on April 10, 2023

Analog circuits are ones that primarily deal with the processing and manipulation of analog signals, while digital circuits are those that primarily deal with digital signals. Refer to my prior reply for the difference between analog and digital signals.

A clock is not necessary in this definition. For example a simple logic AND gate is digital and has no clock. The input is not captured or latched before being modified; nevertheless it is a digital circuit.

Ballas · on April 6, 2023

If you look at the image, it seems that the "mux" is actually a data buffer (probably much larger than a single transaction's worth), that would mean more latency for a transfer, but less time holding up the bus. At least that is my interpretation.

monocasa · on April 6, 2023

I can't tell if it's a buffer in the CS sense like you're saying, or a buffer in the EE sense, which is a device which simply boosts weak signals when they have too many taps in exchange for a slight delay.

mikewarot · on April 6, 2023

Hopefully this is reliable memory, not subject to rowhammer or other hardware design failure, with full ECC. We have reached the point where memory just can't be trusted otherwise.

_a_a_a_ · on April 6, 2023

Oh come on! It's a problem all right but we don't need to go overboard.

snvzz · on April 6, 2023

Hopefully full ECC?

CTDOCodebases · on April 6, 2023

Woah, woah, woah… what’s with the crazy talk?

Do you want hardware vendor RMAs to increase or something?

Cthulhu_ · on April 6, 2023

[flagged]

NKosmatos · on April 6, 2023

Said someone who hasn’t lost data due to bit flips and/or bad RAM sticks. The small disadvantages of ECC RAM (slightly more expensive and a bit slower) are nothing compared to the main advantage it offers, the extra safety and peace of mind. I’ve seen people investing a lot of money in high spec PCs and storage systems, while at the same time being cheap on the RAM or selecting based on speed and appearance (RGB).

xxs · on April 6, 2023

>slightly more expensive and a ==bit slower==

why slower? b/c there is an extra bit per byte - or more like no one produces 'overclocked' memory with ECC as there is no demand?

>speed and appearance (RGB)

Seriously why you'd build a home rig, have a party (home) and make it dark instead of rainbow alike: zero street cred, and no invitations to LAN parties.

magicalhippo · on April 6, 2023

> why slower?

From what I can find[1][2], it seems ECC is not faster for memory with same frequency and timings. Interestingly both sources find ECC being consistently a tad faster in some benchmarks.

So presumably it'll be what you suggest, they don't come in overclocked variants. Of course, overclocking means lower margins hence greater chance for sporadic errors.

[1]: https://www.techspot.com/article/845-ddr3-ram-vs-ecc-memory/

[2]: https://www.techpowerup.com/forums/threads/workstation-ddr4-...

PetahNZ · on April 6, 2023

Most people would be better off putting that money into a backup solution.

crest · on April 6, 2023

You can't use the deduplication and compression which amplifies bitrot if your memory is (too) suspect and backups don't help you unless you notice the problem quickly enough.

Unreliable memory can corrupt your data in ways you won't notice immediately by which point the corruption could have spread. Lets use a SQLite database as example: an index is read into memory, a bit flips in memory, the wrong result is returned by the corrupted index, the corrupted cached index is used in queries, a corrupted query result used to update the database, the corrupted cache wasn't marked dirty so it doesn't get written back to the database file. Repeatedly backing up the ticking logic bomb created this way doesn't help you.

gary_0 · on April 6, 2023

Subtle data corruption due to bit-flips isn't necessarily solved by backups. You might not notice until the damage is already done, and having old good data somewhere would be no help at all.

AndrewDavis · on April 6, 2023

This happened to me. Lost a bunch of data.

I had backups on an external drive that I'd periodically copy data to.

I can't remember the exact sizes but this will still explain in principle what happened.

I had a 1TB drive in my desktop. I had a 500GB external drive. At time of purchase I had less than 500GB to back up.

At some point in time my desktop hard drive started corrupting data unbeknownst to me.

The amount I needed to backup grew beyond 500GB so I purchased a new larger backup drive. I did a full copy (corruption and all) from my desktop to new backup drive.

At some point I repurposed the old backup drive for something else erasing it. It is at this point I have irrecoverable data loss and I still don't know.

The corruption became so widespread on my desktop drive I became aware of it. I check my backup and discover a non trivial amount of my data was corrupted.

CTDOCodebases · on April 6, 2023

I had a similar thing happen to me. The sata controller probably failed.

At first it corrupted a few files. I though nothing of it since I had a few power outages. Then more files. So I reformatted but file corruption kept happening. Switched the drive to a separate chipset with the same cable and all was good.

My current solution to this situation is a Low power PC which runs FreeBSD that has ECC RAM and a ZFS pool consisting of five mirrored drives. This PC gets backups pushed to it from my main workstation and makes a snapshot each time. I plan to change it though to a pull configuration. This way it will be immune to crptolocker software performing privilege escalation attacks since no services will be offered and no credentials will be viewed by the workstation. I have to configure it using its own keyboard though.

Even then the backups need to be tested.

csdvrx · on April 6, 2023

> Even then the backups need to be tested.

Isn't that the role of zfs scrub?

Or do you mean testing if say a JPG file is still a valid JPG?

I think there are scripts that can store a md5 of each file in a sqlite database for filesystems without checksumming such as xfs

CTDOCodebases · on April 6, 2023

I meant tested before restoring. If the same problem as mentioned above were to occur I would have backups of all my files pre corruption though they would be spread across multiple snapshots.

Also from my understanding TCP/IP error correction isn't that that great: https://news.ycombinator.com/item?id=25335936

It's definitely possible to write a script that compares a file across multiple snapshots and flags it if it's content changes but its modification time does not. It will just get tricky when the file gets modified between backups as the file could have been modified then corrupted then backed up. In that case how does the script know that the file has been corrupted?

ilyt · on April 6, 2023

So your local disk/RAM corrupts data and it gets pushed to the ECC box...

It's all well if you notice it soon enough, but for rarely touched files they can drop off retention and you're left with corrupted copy

CTDOCodebases · on April 6, 2023

True. The snapshots are not rolling though and I don't have much data but you are right. It's not going to be fun picking through my snapshots for individual files if they get corrupted over time.

This seems like an unavoidable issue though when using a workstation without ECC RAM and a copy-on-write filesystem. I thought about moving the files off my workstation to my NAS which stores my media files. This does tick both the CoW and ECC boxes but it's not properly set up yet. Setting up an iscsi target on the NAS is an option but then it gets fidley when trying recover specific files from different points in time since I can't just browse the snapshot like any other filesystem.

ilyt · on April 7, 2023

Getting ECC memory into workstation should be just "okay, you want it, pay 20% more and you get it", not having to find which combination of CPU,firmware and motherboard is needed for it, it's sad state we're in.

CTDOCodebases · on April 9, 2023

Tell me about it! When I looked at the price of second hand ECC UDIMMs for my server I almost cried.

maccard · on April 6, 2023

I'm a firm believer that local persistent storage should be ripping fast and should fail fast and hard, with data loss. You need backups, it's non-negotiable.

That said, one of the prerequisites for this is that the stuff you're writing to the disk in the first place isn't corrupt.

ilyt · on April 6, 2023

Well, many SSDs already implemented "just fucking die without option to read any data" already...

maccard · on April 6, 2023

Yep, my favourite kind of SSD. I don't care if you trash my photos, they're backed up. I don't care if you trash my code, it's all in git.

ilyt · on April 6, 2023

Backing up corrupted data isn't helpful.

867-5309 · on April 6, 2023

hard for that data to bypass RAM, so backing up corrupt data

crest · on April 6, 2023

WRONG! ECC has caught what would've been random crashes and data corruption twice on my personal workstation that I noticed (detected and corrected). I've seen 100s or corrected ECC errors over the years on my servers and even few uncorrectable ECC errors caught (in one case caused by a bad DIMM and in an other case caused my not fully locking the clips on both sides of a DIMM). You're missing the need for ECC because you're blind to the memory corruption it would prevent until the corruption manifests in large scale symptoms e.g. corrupted compressed files and random unreproducible crashes.

rbanffy · on April 6, 2023

Where do you find that logged?

ilyt · on April 6, 2023

/sys/devices/system/edac/ in Linux. Might need to load driver for cpu tho.

maccard · on April 6, 2023

ECC should be everywhere by default without any exception. Imagine if your car had an upgrade available that made the engine "reliable"?

Avamander · on April 6, 2023

Considering the immense amount of consumer hardware out there, the amount of issues cause by bitflips is not that minuscule. I think especially consumer hardware needs to be more reliable considering how few other risks are mitigated (with for example backups).

xxs · on April 6, 2023

seriously? What Intel did - segmenting the market and forcing the to pay for a feature that's already supported in the memory controller, was just greed.

CTDOCodebases · on April 6, 2023

It wasn’t just Intel. Hardware vendors loved it.

Imagine all the RMAs they avoided since non-ECC RAM just lets the computer keep on keeping on making any RAM issue look like a software issue. Especially in current times with the popularity of soldered in RAM.

Hats off to AMD for supporting ECC RAM though.

xxs · on April 6, 2023

ECC allows the CPU to auto-correct errors for the most part. Then replace the memory, when the errors become too often, or too 'bad' - uncorrectable errors. The hardware vendors should be even happier, more chips per DRAM module to boot.

monocasa · on April 6, 2023

Hardware vendors are very happy when a user's problem lacks the introspection the definitively point to hardware as the root cause, making it not their support team's problem. They also like reducing BoMs.

awestroke · on April 6, 2023

You fell for the Intel propaganda

nodja · on April 6, 2023

Naive question: would this also mean you can also get more memory for cheaper at the cost of space? i.e. instead of using expensive high density 2GB chips you could use 2x1GB or something?

wmf · on April 6, 2023

The multiplexer chip probably cancels out any cost savings.

yazaddaruvala · on April 6, 2023

With the continued evolution of chiplets, I’m annually shocked we don’t have DDR5 built directly into the processors same wafer.

The yield is maybe a factor, but that is solved with experience. Why is no company even trying?

Is it an IP issue? I seriously don’t understand (especially for phones) why it is still optimal to keep RAM separate.

Could someone help me understand this?

and I’d really appreciate leaving “upgradability” out of this conversation. That is a separate topic that can be had but not interesting to me right now.

mateja · on April 6, 2023

CMOS and DRAM are completely different processes and cannot be integrated on the same wafer. This has been tried many times unsuccessfully. This is why you see highly integrated packaging solutions like chiplets. These solutions integrate CMOS and DRAM in the same package, not on the same die.

yazaddaruvala · on April 6, 2023

Thanks, this is what I'd love to understand better.

At the highest level isn't it all EUV? Why can't the laser print both patterns onto the wafer?

realjhol · on April 7, 2023

A CMOS chip doesn't contain any capacitors. DRAM is made of capacitors, and thus has a radically different structure.

ilyt · on April 6, 2023

Chiplets are also used for that, sure (stuff like radios are occasionally separate chiplet made in different process), but I think main push is smaller chips = higher yields

tlb · on April 6, 2023

The best silicon process for DRAM is significantly different than for CPU logic. So it's not practical to combine them on one die -- you'd have to compromise the speed or density of one or the other.

But the chiplet approach works, and nVidia and Apple are both shipping it. I dunno what's going on at Intel. Hopefully they'll pull themselves together.

rbanffy · on April 6, 2023

Their new Xeons have HBM in the package. It's even possible to build a workable workstation without any DRAM slots with those (provided you can get rid of 300W of heat but, for a workstation, using all cores at full speed should be a relatively rare occurrence).

Retr0id · on April 6, 2023

> for a workstation, using all cores at full speed should be a relatively rare occurrence

Huh? Sustained all-core load is why I have a workstation in the first place, it's for anything that would be annoying to run on my laptop. I'm sure different people have different workloads, but I don't think I'm an outlier here.

rbanffy · on April 6, 2023

> Huh? Sustained all-core load is why I have a workstation in the first place

At least mine spends most of the time waiting for my keystrokes. It does do background tasks, of course, and some heavy lifting sometimes (when I have to wait for them instead of the other way around) but it doesn't spend the day doing the heavy lifting because that'd make the machine much less responsive (and much noisier).

OTOH, if I had a 56-core 112-thread monster, it'd probably be perfectly fine to do the heavy lifting AND the interactive work at the same time. Worse case scenario would need to set core affinity so that the heavy lifting doesn't step over the cores running my GUI.

rasz · on April 6, 2023

Apple is not shipping any chiplet designs. Its ordinary POP ram and a ton of marketing.

JoshTriplett · on April 6, 2023

> I’m annually shocked we don’t have DDR5 built directly into the processors same wafer.

Sapphire Rapids has models with 64GB of "High Bandwidth Memory".

larsnystrom · on April 6, 2023

Isn’t this what Apple is doing with their M-chips?

krasin · on April 6, 2023

Apple is using HBM2 ([1]) memory, pioneered by AMD for their GPUs. Essentially, yes - it's that, placing memory really close to compute and making a lot of channels for higher bandwidth.

1. https://en.wikipedia.org/wiki/High_Bandwidth_Memory

kcb · on April 6, 2023

Apple marketing made it sound like they were doing something special with RAM on their M socs. Really they mainly improved packaging form factor with run of the mill LPDDR5. It's questionable if this really does much for performance as the M socs still have worse memory latency than Intel and AMD chips.

krasin · on April 6, 2023

Thank you. I indeed fell for Apple marketing.

rys · on April 6, 2023

They use commodity LPDDR4X or LPDDR5 depending on SKU, for M1 and M2.

krasin · on April 6, 2023

Thanks! I missed that.

skavi · on April 6, 2023

uh no they’re not lmao. unless you’re talking about very particular intel machines no one cares about anymore.

numpad0 · on April 6, 2023

HBM is RAM chip stacked on CPU chip, not an integral CPU+RAM on a single die.

evancox100 · on April 6, 2023

No, in classic HBM multiple RAM dies are stacked on top of each other, but they sit next to the CPU, not on top. Edit: well typically it would be a GPU not a CPU.

nottorp · on April 6, 2023

> and I’d really appreciate leaving “upgradability” out of this conversation

It's not upgradability any more. That means you buy the machine, run out of ram for your working set 2 years later, add more ram.

Now Apple refuses to sell you machines with enough ram from the start.

Would you like your preferred server vendor to do the same?

taosx · on April 6, 2023

Some reasons that I can think of:

- Heat generation (harder to cool)

- Trade offs don't justify the pain

freilanzer · on April 6, 2023

I just want Cuda or an equivalent for my AMD Gpu...

Cloudef · on April 6, 2023

ROCm/hip/hipify, the cuda stack and the alternative implementations are a clusterf** though. Can't wait for oneapi or vulkan backends in popular ML frameworks to take over..

Klaster_1 · on April 6, 2023

This was exactly my sentiment when reading opening the thread. After purchasing a 7900XTX I feel left out of a lot of fun stuff, like stable-diffusion-webui (SHARK does not provide as much features) or running LLMs locally on Windows. After a bit of search, I found https://github.com/RadeonOpenCompute/ROCm/issues/666, so chances are the situation might improve.

gautamcgoel · on April 6, 2023

Will be interesting to see how this stacks up next to DDR6, which is supposed to come out in 2025 or so, and is also supposed to double bandwidth relative to standard DDR5. My guess is that DDR6 will be more common, with these MRDIMMs reserved for server/HPC builds.

brucethemoose2 · on April 6, 2023

'Interestingly, the slide also says that the need for DDR6 memory is "unclear" due to uncertainty about its value proposition. It goes on to say "Buffered only to deliver value?" possibly implying that MRDIMM technology could be the order of the day as system RAM moves forward.'

amrb · on April 6, 2023

Good to see the boost in bandwidth! with NVME hitting 9GB/s did make you think if hbm was, our only path to go faster!

sounds · on April 6, 2023

NVME's peak advertised speed is only achievable when you're hitting the NVME's DRAM cache.

The PCIe link isn't going to be competitive with the DDR5 link, since both are measuring DRAM access speeds in this case.

RobLach · on April 6, 2023

Oh cool. No reason these muxing ideas can't be pushed to embeddeds.

sand500 · on April 6, 2023

What are the extra pins on the side of the ram stick for?

christkv · on April 6, 2023

I can't find what this means for latency. Double ?

wmf · on April 6, 2023

Latency should be the same if they interleave properly.

superkuh · on April 6, 2023

It's really not cool how RAM these days is advertised and sold by it's potentially possible XMP overclock speed instead of the Jedec speed it's actually rated for.

toast0 · on April 6, 2023

Selling based on JEDEC ratings only would be unhelpful. Not sure about ddr5, but ddr4 JEDEC ratings only go to 3200; there's good reasons to want to run Ryzen chips at 3600 or higher. I'd rather buy something where the packaging tells me it should work at the speed I want to run it, than have to figure out which skus people say tends to work, when all the boxes just say 3200.

Yeah, if you start getting higher speeds, it becomes more a question of if the memory controller on the cpu, the wiring on the motherboard, and the ram itself will work together at that speed, which you can't guess from just the number on the ram packaging.

It would be nice if JEDEC offered more ratings, but given that they don't, I'll take the XMP numbers.

Dalewyn · on April 6, 2023

I'm not sure why it has to be either-or.

Give us both the rated JEDEC clocks and timings and the manufacturer overclock clocks and timings.

To use the DDR4-3200 example, it is far more difficult than it should be to figure out whether that's JEDEC 3200MHz or actually JEDEC 2166MHz with 3200MHz XMP overclocking.

Night_Thastus · on April 6, 2023

XMP is not like an old-school overclock. Assuming the motherboard isn't faulty, you're pretty much guaranteed to get the clock speeds advertised with no instability. It's plug and play.

superkuh · on April 6, 2023

It is supposed to be plug and play but a good fraction of the time it is not. I've been assembling computers from parts since the mid-90s and I never had any trouble with RAM till my latest builds with DDR4 trying to use default XMP settings. In fact I've never gotten a DDR4 computer to run at the advertised XMP speed and timings even when the mobo fully supports it. I always have to back off a few hundreds of MHz and loosen the timings to prevent memtest86+ errors. I've observed others on IRC computer hardware channels reporting the same problems (selection bias, of course, only people with problems ask questions).

You've been lucky (or you're using latest and greatest mobos).

toast0 · on April 6, 2023

I've had good luck with DDR4 and reasonable speeds working with XMP, but sometimes the XMP data has a voltage lower than what the packaging indicates, and setting the voltage according to the packaging works a lot better.

But I'm just getting DDR4-3600, not DDR4-5000 or something; and I think all my DDR4 setups are 1 dimm per channel, which helps. I didn't ever get DDR3 XMP to be happy though.

_a_a_a_ · on April 6, 2023

Maybe for home use but I really doubt it's sold that way for servers.