I think there will soon come a day when DIMMs become wider and not faster - high clock frequencies are attractive for marketing purposes but as this article points out, are increasingly harder to design for. DIMMs have been 64 bits wide since they appeared, to coincide with the P5's bus width, but modern processors have much wider databuses, so it'd make more sense to maintain the same clocks but widen each module.
Starting in the 90s there was a trend of replacing parallel buses with serial ones (USB, SATA, etc.) but I think it's slowly beginning to reverse as people realise the difficulties of using high frequencies (e.g. multiple lanes in PCI-E).
Incidentally the term DIMM - dual inline memory module - comes from the doubled width over the existing SIMM, which was 32 bits wide. Maybe we'll see 128-bit-wide QIMMs in the future...
Because of latency difference between RAM and on chip caches you don't gain too much by making wider data bus to RAM, because latency is still the most limiting factor. Modern solution is to have multiple full memory interfaces ("channels") which give you the ability to both behave as if the memory bus is wider as well as issue completely different memory transactions given that the accesses are suitably aligned (which is more significant performance boost).
Also the general trend of weird (i.e. PCI) and serial interfaces in 90's was motivated by per-unit costs at the expense of required engineering. Primary motivator for PCI's reflected wave switching and multiplexed address and data was to limit number of pins and number of required passive components on motherboard. Serial interfaces that came after that (SATA, PCI-E) were motivated by the fact that routing fast parallel synchronous buses for any significant distances is hard problem because of propagation times, which have to be roughly equal for all bus wires, which implies that PCB material parameters have to be known and reasonably consistent (and that means significant per-unit additional expense in PCB manufacture and testing cost).
Currently only interface that requires careful routing and design on PC motherboard is memory interface, which can be made short enough that manufacturing differences are negligible. (It's funny how TI has 20 page application note on correct routing of USB2 which includes rules as "no vias preferably, at most one", "no stubs", "controlled impedance" and "as short as possible" while Intel's layout recommendations for USB2 can be summarized as "it's differential pair, discontinuities do not matter much, use sensible routing")
Also, routing wide fast interfaces is hard, because of clock skew between the parallel lanes. Having independent parallel channels mitigates that somewhat, but if anything the trend is towards narrow serial interconnects.
That is more of an answer to routing and signal integrity problems of fast parallel interfaces, e.g. bandwidth. Most of the DRAM latency is inherent in the DRAM array itself (actually getting the data between dram array and sense amplifiers). Almost nobody cares about additional cycle of latency introduced by registered dimms, as it is essentially noise compared to precharge latency of the dram itself. This comes mostly from physical limits of what can be manufactured with reasonable power dissipation and reliability (see how "CAS Latency" grows at comparable rate to clock rate, ie. stays mostly comparable in wall-clock time)
Take a look at Micron's "HMC" stacked memory architecture. THIS is a move in the right direction. The short-term caveat is its focus on higher-performance niches like servers. It's a long time before you'll see this type of architecture in consumer PCs due to DDR3's dirt-cheap pricing.
> Modern solution is to have multiple full memory interfaces ("channels") which give you the ability to both behave as if the memory bus is wider as well as issue completely different memory transactions given that the accesses are suitably aligned (which is more significant performance boost).
In other words, RAID-0, for memory. :-)
I wonder if it would be interesting to have memory controllers managing striping/mirroring of memory modules and create read-optimized and write-optimized memory regions. 2 mirrored modules would give you half the latency for reads and the same latency as a single one for writes.
Has anyone already done this?
edit: and now I'm imagining an inter-memory-module bus to manage bank transitions to/from mirrored/striped without loading the processor bus.
>I wonder if it would be interesting to have memory controllers managing striping/mirroring of memory modules
This is already common. Log onto Dell.com and configure a high end server, you will see an extensive number of options regarding mirroring and advance ECC configurations.
its like two individual drives, one for /etc the other for /usr
and no, you cant have half the latency, latency is dictated by physical speed of actual ram inside chips, those are clocked at 200MHz for typical DDR3 1600Hz module, up to 300MHz for fasters DDR3 ones.
Actually it's more like raid0 than separate partitions, because anything upstream of memory controller does not have to care about this in any way. The amount of complexity between modern CPU and memory is quite fascinating (and huge amount of essentially invisible and relatively complex stuff is there at least since i486, i.e. with anything that does not care, in the works/does not work sense, about how you combine memory modules).
Term DIMM has to do with the actual mechanical connector, not with data bus width. SIMM modules have pads on opposite sides of connector connected together, DIMM ones do not.
There is even JEDEC specification for DIMMs with 32-bit wide data bus (Such DIMM was used in Corel's NetWinder for instance).
And as for parallel vs. serial, essentially whole second page of the articles is about issue that is mostly specific to synchronous parallel interfaces. Multilane PCI-E neatly sidesteps such issues by using what essentially amounts to multiple independent serial links that are bonded together. But having serial interface to RAM chip greatly increases the complexity of the RAM chip (SDRAM protocol looks at the first glance as something that is very hard to implement in RAM chip, but in comparison with interfaces like PCI-E it's trivial state machine).
One day DIMMS will become obsolete. Intel is already trying to move to stacked memory connected via a very wide in package bus. Knights Landing will have 16GB of the stuff.
Or rather, wire up the address pins in any order, and the data pins in any order. A simple RAM chip's inputs were just symmetric. As long as you didn't care what transistor was storing which bit you could wire it how you liked.
For a ROM chip you actually had to be careful so that it actually generated the correct output for a given address, unless you were willing to burn an EPROM with the reverse transformation applied.
Even with DDR3 (and probably DDR4) you can still swap bits within a byte lane, and bytes within a word/channel depending on the setup of the memory controller. Some things change, and some stay the same!
Also, most current DDR3 controllers "scramble" which wire carries which bit based on the address. The idea is that when writing constant repeating data patterns, spreading energy among the wires reduces emitted EMI.
When you do a suspend-to-ram on a current x86 machine, it has to save the "scrambler seed" to CMOS so that it can decode the data it left stored in RAM.
I think this also helps mitigate the attack where the DRAM is chilled to make its contents less volatile, the machine is powered off, and the DRAM is then dumped in search of sensitive material.
That actually surprises me.. I would have guessed that there would be some amount of metadata you could query over the data lines, such as the specs for the DIMM. I guess that's all sent over I2C or something instead?
Yeah, that's all done through a seperate EEPROM chip accessed over SMBUS, which is essentially the same as I2C. Raw SDRAM chips don't provide any way to query their specification; on embedded systems that use them, the settings are generally all hardcoded in the bootloader.
SDRAM's interface doesn't leave room for extra commands, it's pretty much designed to read/store data and refresh content. It only has 3 lines to select the command (RAS#, CAS#, WE#).
Other RAMs/Flash with different interfaces (e.g. SPI) require a prefix command to be sent over the wire, so it's easier for them to implement things like chip id commands.
Starting in the 90s there was a trend of replacing parallel buses with serial ones (USB, SATA, etc.) but I think it's slowly beginning to reverse as people realise the difficulties of using high frequencies (e.g. multiple lanes in PCI-E).
Incidentally the term DIMM - dual inline memory module - comes from the doubled width over the existing SIMM, which was 32 bits wide. Maybe we'll see 128-bit-wide QIMMs in the future...