Hacker News new | past | comments | ask | show | jobs | submit login
IBM doubles its 14nm EDRAM density, adds hundreds of megabytes of cache (wikichip.org)
202 points by insulanian on March 8, 2020 | hide | past | favorite | 77 comments



I wonder if it will ever see a chance to go mainstream.

The memory bottleneck is pretty much the only thing in CPU design that didn't see a dramatic improvement over the years. Its elimination is the only obvious improvement pathway still left with expectation of double digit performance gains.

So we need either very big and very fast caches, or extremely wide and low latency memory. Both options are quite costly.

Adding on die DRAM that can work at least as fast as 500mhz will surely require some specialty process with a lot of compromises like the one in the article.

Gluing something like HBM2 to the die for a second option moves the cost from the specialty process to the specialty packaging. Not much better.


The reason why mainstream DRAM interfaces are narrow and “slow” is that you need row-at-a-time access patterns to really saturate the interconnect which is something that does not happen for general purpose workloads and causing such access patterns requires large caches which by themselves solve the issue and then also physical package pins and pad structures are one of the most expensive things in semiconductor design.

In end the DRAM array is bunch of analog magic and the interface works by copying the row you want into SRAM buffer on the chip which you then can access however you want. And the slowest operation in all that are the copies between SRAM row buffer and the actual DRAM array. (what I call SRAM buffer is usually called “column sense amplifiers”, but for the highlevel view it in fact is surprisingly wide array of 6T SRAM flipflops and some analog magic)


So how many levels of cache do we have now between the ALU and the memory cell of record??


Physical register <-> L1 <-> Fill Buffer <-> L2 <-> L3 [<-> L4 ] <-> [bus buffer <->] memory module

I wouldn't count buffer registers inside DRAM chips as a separate cache.


Counting the DRAM row buffer as a separate cache layer is interesting because it has meaningful performance impact, but mainly because existence of such an thing is good counter argument to people who think that DRAM chips have RAS/CAS multiplexed address pins only to save package pins.


I wouldn't really count the fill buffer between L1 and L2 as a separate caching level, or at least if you wanted to do that you should do something similar for L2 <-> L3, L3 <-> memory controller, etc: since such buffers or queues will exist at all those places.


L1 Fill Buffers are somewhat special at least on Intel because a memory read can be satisfied from these before the data is written into L1.


↔ SSD ↔ spinning rust ↔ tarsnap or something


I beg to differ, the thing that make the original AMD64 chips so freakishly awesome was multiple memory controllers in a multi-socket system. That forced Intel to do something kinda similar but the Opteron line and then the Ryzen and EPYC lines have always been better than Intel in terms of raw memory bandwidth[1].

And when there was a clear need for the bandwidth, as in GPUs, circuit designers stepped up and created some amazing wide and deep memory bus architectures.

I would not be surprised to see AMD partner with IBM to utilize some of their EDRAM tech to keep themselves out ahead of Intel in the data center space. The more ways they can distinguish themselves, the more pressure they put on Intel's design teams.

[1] Yes, lookaside buffers and page tables in the Opterons took some of that advantage back, but it was still substantial.


> Gluing something like HBM2 to the die for a second

This is something AMD is already doing for EPYC, and they've already used HBM2 in their GPUs.

So I'm surprised they haven't released any CPU models with crazy huge L4 caches using a few GB of HBM2.

Then again, Intel made a laptop CPU with a huge 128MB cache and their comment was that it didn't make that big of a difference. I believe the performance boost was less than 5% for going from 64MB to 128MB.


Its all about how fast the memory is.

960MiB is prodigiously large for such a microscopic chip, but if it "only" gain 3-5 times latency reduction over external DRAM, it's still very far from a proper L3 implementation, and far behind L2.

Make DRAM work on 1Ghz+, and then you will see miracles. Imagine a fully synchronous on-die DRAM that can sit just behind L1, or even be connected to load registers directly.

The problem is that effective frequencies for memory round-trip haven't got up much since nineties. If you work with 100% cache misses, your mem will still be working at effective frequency of around 100 to 200Mhz


> If you work with 100% cache misses, your mem will still be working at effective frequency of around 100 to 200Mhz

I think you mean 20MHz. Current DRAM is abysmal at random access.


Yes, more so like this if you discard how it looks from the electrical side.

Even after bytes arrive to DRAM controller analog side, a lot of things have to happen before the data gets to the register. This accounts for the further five to tenfold increase in round trip latency


Oh no, it's paging to ram.


Maybe there’s an inflection point where imperative management of the cache is more effective than heuristic management.


I saw Alan Kay speak once, I think at lambda jam 2013. He made a comment that he’s still waiting for mainstream CPUs to allow us to fit an entire interpreter into on-core cache.


Eh? The LuaJIT and K interpreters definitely fit into L1 instruction cache. (And have since well before 2013 probably.) Honestly I would be kind of surprised if most interpreters for reasonably small languages don't fit into icache. Sure Python's not going to any time soon, but most interpreters are really not very large.


You could get Windows NT 4 running in a processor cache these days.


Forth interpreters probably fit into L1 on modern processors.


And we've had 256KB L2 since the Pentium Pro. That's big enough for a Lua interpreter.


What do you mean by imperative? Static control by the program?


I'm guessing u/hinkley meant having an API for the cache, so a compiler could generate instructions to prefetch cache lines (and maybe even keep them hot) rather than depend on automatic (heuristic) cache management.


Just FYI, prefetch instructions are already a thing (at least on x86). At present they're only hints to the underlying heuristic system though. (https://www.felixcloutier.com/x86/prefetchh)

Cache-as-RAM is also a thing, and allows the sort of pinning you described. Pretty specialized use cases though. (https://stackoverflow.com/questions/41775371/what-use-is-the...)


Read access patterns matter more than cache sizes, triple digit improvements are possible of you have linear reads.


Late Xeon Phis had 16GB of HBM around the die. It could either complement or mirror main memory.


Hasn't Intel bought the company that was on the brink of producing those CPU-memory combos? Unfortunately, I haven't been able to find the name of the company or an article about it.


You mean UpMem or Venray Technology?


Sorry but I don't remember the name at all.


With all of the speculation security bugs in chip cache management, I can’t help but wonder if we won’t eventually go full NUMA and turn the cache memory (or at least L2+) into a directly addressable space, either by the kernel or directly by application code. At which point your working set is explicitly on the processor, instead of implicitly.

I also wonder if chiplets will be the vehicle by which this comes to pass.


Cache can already be made directly addressable (cache-as-ram mode). But outside of some niche solutions that's mostly done during boot due to the limitations of that mode and cache being too valuable.


Commodity DRAM latency is mostly array line dominated, not due to proximity/distance, see eg https://ieeexplore.ieee.org/document/6522354

Also eDRAM is difficult to scale


What is the state of adding higher-level logic to the DRAM chips? E.g. clear a whole DRAM row to 0 (blank memory for new processes). With today's processes, we can probably fit whole simple processor cores there easily.


This is called PIM (processing in memory) and UPMEM [0] has a working implementation.

[0] https://www.anandtech.com/show/14750/hot-chips-31-analysis-i...


Latest DDR standard, and supposedly HBM3 in making support range requests


I guess this would expose new risks of timing side channel attacks, though.


Don't run untrusted code


The same process tech is used on POWER9, and a lot of the physical design is shared between z and p, so this eDRAM is pretty mainstream in POWER.

OpenCAPI seems like it will provide almost as fast (latency) as an on die controller with DDR4 attached RAM at much cheaper prices than contemporary designs.. and opens the door to 3rd party innovation.

The real issue is trading latency versus bandwidth.. IBM's scale up designs have had ridiculously large memory bandwidth forever, but latency suffers to some extent with buffers like these SCs in the article. Whether that matters for your workload really depends.. but people need to break out of lock/latches mindset and use safe memory reclamation techniques like RCU and EBR to avoid common critical sections for lifecycle management, and in general minimize synchronization to make full use of current designs even like the now common AMD Rome.

I firmly believe we've been in a time where better programmers aware of hardware/software interface have been needed for a while, rather than some kind of hardware physics brick wall as the pundits often claim.


Quoting: 'The memory bottleneck'

(Raises the head from her electric soldering iron:) Maybe "in a world without any 'North- or Southbridge'-bottleneck" ?

Or do you actually have figured out to attach a keyboard directly to your graphics-card (still powered by a PSU), cos... your graphics-board has (plain and simply) CPU-Power, RAM and for sure 'ports' to attach something... ?! (-;


no


Oh no, the packaging connections are (guessing) thousands of nanometres. Much easier to build, much cheaper. Also, the process of moving to multi-chip packages means the process yield for the chiplets is much higher. It's a bloody good idea :)


Intel laptop parts had 128MB of eDRAM starting in 2013. Is that mainstream enough for you?


Had. And then floated away down the stream, never to be seen again, at least not with 128MB.


On what frequency did that eDRAM worked?

If the effective freq was 100mhz, you being hit by cache misses will indeed be "only" 2-3 times less bad than as if you used memory round trip


If they had ever made it available at the mid-high end or released it outside of a couple weird SKUs before they quietly buried it, then it would count.


Twelve point two billion transistors. That's absolutely nuts. Does anyone have a ballpark figure for how much a 'drawer' of four of these things costs? What's it supposed to run, is this an Oracle/DB2 beast?


There is a version of DB2 for mainframe, but it’s a totally different codebase as I understand it. The operating system on a mainframe provides a lot of what modern app developers get from their database and caching systems, generally with better fault tolerance and availability. So if you have an app built for mainframe, it often will not have an external database dependency. Doing transactions can just look like writing to memory or files.


When I was running Z for Linux the costing was in the order of six figures per processor.


Highly dense Docker hypervisor hypervisor hypervisor.


Nobody's even mentioned Crysis.


> is this an Oracle/DB2 beast?

My impression is that RDBMS workloads are generally far more IO-intensive than CPU-intensive.

Looking at my own Azure SQL CPU vs IO vs Log charts right now, even with some CPU-heavy OLAP queries barely passes 25% CPU in the 2 vCore-sized database used in my current main project.


Wouldn't a DB benefit enormously from faster memory access? Perhaps you're only getting 25% CPU usage because the CPU is stalled waiting for main memory to reply...


An EPYC processor with 64 cores contains 40 billion transistors spread over 8+1 dies.


The cost will decrease over time, just like all tech.


Impressive tech. How big is the market for these machines these days? Like how many Z15 CPs would they expect to sell (assuming each Z15 install can vary a lot in size).


This is likely catering specifically to IBM's customers who have been them from the mainframe days and continue to rely on IBM products (airline, banking etc.). The systems used by these orgs are massive in complexity and I'm not sure how much they want to invest in refactoring them to run on COTS hardware.... it probably doesn't make sense for them financially.


There is market for reliability, security and scale in compact size, so they can get new customers.

Companies like Robinhood may discover that it's actually cheaper to by reliable and secure hardware and write software into it than try to write fault tolerant and secure software over COTS hardware.


Is it really that easy to write reliable software for mainframes? What sort of abstraction do you get from them that you don't get in commodity hardware?

Nice easy shot at Robinhood of course, but they're a very young company. Do the big banks really have a fundamental edge that's not just more invested hours?


I think the idea is that your locks and threads are so close latency wise that you only need a few to get the job done. Versus trying to figure out how to parallelize things, some of which are inherently non-parallelizable. But sometimes you don't know until you try! Ain't life grand?


IBM started to market what is essentially z with only IFL CPs as kind of k8s in a box so they obviously try to expand into lower tier markets.


I suppose there is a world market for maybe five of those…


I understood that reference.


What, 'five eyes' or?


It's referencing former IBM chairman Thomas J. Watson's alleged quote:

https://en.wikipedia.org/wiki/Thomas_J._Watson#Famous_attrib...


Much less ominous, thanks! :-)


No one will ever need one of these in their home


Same question, and I wonder why much do they cost too.


This is my type of tech. Not glamorous but highly functional.


Are there any modern initiatives around mainframe? What is IBM doing to promote it more to new generations?

Are there any startups doing anything related to the mainframe?

I was always fascinated by the tech around mainframe and am even thinking about moving into that space. I can imagine that a barrier to entry is high... or?


Mainframes are interesting to me (but I think a lot of things are interesting) :)

I found this one site with links to books on mainframes: http://www.mainframes.com/Books.html

If IBM is investing in 14nm tech for mainframes, it's not dead tech. A quick search revealed the following:

""" 70% of the world's production data, and 55% of world's enterprise transactions, took place on mainframes (2016) """

http://ibmmainframes.com/wiki/who-uses-mainframes.html


For an individual, yes - the barrier to entry is high. But I don't really see any way to change that, due to the very nature of mainframes: systems of complex and highly tuned special purpose submodules. This is the first thing you'll notice when digging through the literature, lots of brand new non-standardized acronyms for purpose-built subsystems. While a lot of it certainly has the taste of needless market segmentation, there is a lot of unique stuff that is genuinely scarce. Yes, you can get a second hand mainframe at a bargain price - but your really don't want to unless it is part of a much larger tax advantaged living museum project. There are emulators though, Hercules and zPDT.

My interests led me more to the Power architecture, which is weird enough to hold my interest - while still being practical at the individual scale. For example, the z15 comes with the NXU compression accelerator - the p9 has a similar (same?) NX coprocessor. You'll also find market segmentation here, and IBM would do well to knock it off with the weird PowerVM/NV/Opal/ePAR/LPAR stuff. Their performance monitoring and scheduling stuff is really awesome, and it is unfortunate that they use it to segment product offerings. It isn't as ugly as Intel's ECC games, but it still isn't a good look.


There's the Open Mainframe project, and Zowe I guess:

https://github.com/zowe

If you're interested, there's also the Master the Mainframe initiative. Mainly a competition for those in school, but the 'learning system' offers year-long free access to those of us to whom education is a distant memory...

https://www.ibm.com/it-infrastructure/z/education/master-the...


The barrier is not always high https://www.youtube.com/watch?v=45X4VP8CGtk


Are you trying to say that it is wide and heavy as well :)

Joke aside, how are people actually getting started in this space? And more importantly, is it "worth it" financially? Is the mainframe skill shortage ("dreaded COBOL"?) a real thing?


> How are people actually getting started in this space?

Depends on your needs. Do you need near-perfect uptime? Most don't; their products aren't so critical, so they settle for a cheaper option. Do want to run a data center? Most don't; it's so much easier to push code to AWS and let them handle the infrastructure demands.

Startups historically survive because of adaptation and speed. The mainframe may not fit those operational principles, though that's up to a given startup (see below for industries which might benefit the most from a mainframe).

> Is it "worth it" financially?

The System Z series excels at transactions and updating records. Industries which deal extensively with these types of computations are banks, airlines, credit card companies, stock brokers, insurance companies, and certainly others. If you're in one of those lines of business, you probably should look into mainframes. More generally, if you need to maintain system state at all costs, mainframes are probably a good option.

I would not recommend mainframes for heavy, laborous computing loads (scientific computing, rendering boxes, etc.)

> Is the mainframe skill shortage ("dreaded COBOL"?) a real thing?

If a company wants to add a new feature to or fix a bug in a 40 year-old COBOL program, they'll likely have a hard time finding a young programmer, sure. Some older COBOL coders are helping fill the game while they can. Don't forget that the System Z mainframes have a level of backwards compatibility that makes x86 blush; your COBOL program will certainly still run.

I wager most new programs (<15 years) have been written with Java or C++, given z/OS supports more languages than COBOL: https://www.ibm.com/support/knowledgecenter/zosbasics/com.ib...

COBOL is dying, as it would be ridiculous to start a new project in COBOL. But many legacy systems still work, so why change them if they aren't broke?

I've made a comment in the past about IBM mainframes which you also might find informative: https://news.ycombinator.com/item?id=20978305


Thank you for the answer and sorry for not being clearer, but I was asking from the perspective of an engineer that wants to enter the mainframe world.

Let's say I want to focus on developing, maintaining and running the software on mainframes. How do I get "in" and are the skills paid well in comparison to a typical Java/.NET/C++ developer position nowadays?

You touched a bit on dying workforce when you mentioned COBOL. Is that dying workforce a real problem making it financially lucrative for the people willing to learn that stuff? Or is it just a myth?


IBM has an internal program for training people in Mainframe skills, but for the life of me I can't remember what it is called. I was going to be part of it when I graduated college, but then 08 happened and I was moved into Open Systems support after a RIF.

A google search showed me this page. https://www.ibm.com/case-studies/ibm-academic-initiative-sys...


> Let's say I want to focus on developing, maintaining and running the software on mainframes. How do I get "in" and are the skills paid well in comparison to a typical Java/.NET/C++ developer position nowadays?

IBM has quite a few training and certification programs available [1][2][3][4]. (Number four is quite interesting: a yearly competition designed to teach mainframe skills.)

From my understanding, Java development on the mainframe isn't significantly different from standard programming. Much of the heavy lifting happens in the background, and the programmer's focus then becomes learning the ins and outs.

As for compensation:

https://www.glassdoor.com/Salaries/mainframe-developer-salar...

https://www.glassdoor.com/Salaries/software-developer-salary...

Mainframe dev: $74.9k. Software dev: $76.5k. Seems fairly equivalent.

> You touched a bit on dying workforce when you mentioned COBOL. Is that dying workforce a real problem making it financially lucrative for the people willing to learn that stuff? Or is it just a myth?

All good myths rely on elements of truth. :)

Will there be COBOL positions? Sure, for at least 10-20 years. But how do you want to set yourself up for additional growth?

If you invest time to learn traditional blacksmithing techniques, you might find a job at an interactive museum or on some specialized YouTube channel. Nothing wrong with that. Can't say there's lots of growth in that industry, though. The time and effort invested is about preserving the methods of the past. So, are you looking to preserve or create? Either option is perfectly fine, but each presents tradeoffs.

----

Sources:

[1] https://www.ibm.com/certify/

[2] https://www.ibm.com/case-studies/ibm-academic-initiative-sys...

[3] https://www.ibm.com/it-infrastructure/z/education

[4] https://masterthemainframe.com/




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: