What every programmer should know about memory, Part 1

sho_hn · on May 2, 2012

If you want to read something more approachable that can serve as a gateway to this, I highly recommend Jon Stokes' Inside the Machine. It goes over microprocessor architecture in a very approachable way, and contains a good chapter on the memory/cache hierarchy that is sort of the light version of a good chunk of Drepper's paper, or at least will be a big help in making sense of it. If you've never been close to the metal, or you want to catch up on newer developments, consider checking it out.

Also, here's Drepper's paper compiled into a single PDF, which I think was edited a little later than the LWN publication and might have had some minor errata fixes: http://www.akkadia.org/drepper/cpumemory.pdf

wmf · on May 2, 2012

As much as I liked Stokes's articles in Ars, I think programmers should probably skip Inside the Machine and go straight to Hennessy and Patterson. Inside the Machine is loaded with analogies that I find annoyingly distracting; I'd rather learn how a computer actually works than learn an analogy about how it works.

klochner · on May 2, 2012

Very useful information, but this is way overkill for what "every programmer should know".

Every programmer should have a basic understanding of latencies.

This is slightly out of date but still gives an idea of relative latencies: http://norvig.com/21-days.html#answers

prophetjohn · on May 2, 2012

Exactly. This is a significant portion of a graduate-level computer architecture course. You average Rails/Node/etc. programmer doesn't need to know half of what's covered during the 3 weeks of an undergraduate architecture course that's devoted to the memory system.

How many programmers in 2012 are ever even going to need to cache-optimize their code? 25%? 10%? The content is fascinating, but the title is pretty wrong.

ajross · on May 2, 2012

I hear this so much, but it's just wrong. Here's a "interview style" question which is hopefully practical enough to show why:

Let's say you have a giant in-memory balanced binary tree (for your data store, maybe) with mean depth of ... 27, say. And you're deployed on a machine with 4 DRAM channels populated with (for the sake of argument) 7-7-7-23 DDR3 memory. Give me a ballpark estimate for how many reads per second you can achieve from that tree.

My experience is that virtually no one can answer this question well, and most programmer's intuition is wildly off (they think it's much faster than it is). And even the ones who get it mostly right usually just use "100 cycles" as their assumption of DRAM latency and get tripped up on the subtleties (like the interaction with the number of channels vs. precharge overlap; or the already-active row latency for sequential access which this question doesn't cover).

Reading this paper will teach you how to answer that question from first principles. Surely that's worth a few hours of your time. At the very least it's worth more than arguing on the internet about how you don't need to know it.

(edit: A bonus followup question -- hopefully just as practical -- would be: "So maybe binary trees aren't great. Given a node data size of N, what is the most appropriate fanout to use in a B tree implementation?". This pulls knowledge of caching architectures into the mix.)

btilly · on May 2, 2012

Sometimes the job of an interview question is to make it clear to the interviewee that this is not, after all, somewhere they want to work.

That would be my reaction to those interview questions.

As for the follow-up question, my response would be to ask why you are optimizing your B tree for a specific architecture instead of using a cache-oblivious B tree. (Yes, I can think of several reasons why not to. But if the interviewer is asking me BS questions, I want to know whether their knowledge is similarly good.)

Of course in practice I'm generally happy to use BerkeleyDB - the B tree has never been a performance bottleneck for me except when it involved disk latency. And at that point a more fundamental reorganization of algorithms becomes necessary.

cbsmith · on May 3, 2012

Cache wasn't mentioned in the question... there might be no cache at all. ;-)

Cache oblivious B-trees aren't exactly oblivious to how DRAM behaves, and sometimes their operation actually can be counter to what works best in DRAM (though I agree that generally they tend to do better than data structures that are "oblivious" in a very different sense).

But let's ignore that for a moment. Okay, you aren't using BerkeleyDB, MySQL, or any other of the common B-trees out there, but instead you've got your hands on a cache oblivious b-tree. What is the approximate optimal fanout rate and what is the approximate number of reads per second you can achieve with the 7-7-7-23 DDR3 memory (or memory of your choice that is commonly found in modern systems)?

The question isn't really about knowing the specifics of a particular memory infrastructure. It's about understanding the cost of going to memory, which is increasingly as important as understanding disk latency (you know how "disk is the new tape"? well, RAM is the new "disk").

Interestingly, a lot of databases these days are CPU bound , particularly when working with big data, but even in other cases. They aren't really CPU bound, it just looks like that because of how accounting is done. When you poke under the covers, you see the CPU pipelines are comparatively idle because the CPU's are all tied up accessing memory.

Particularly if you are using AWS or other virtualization services, most web applications these days rely on servicing a majority of their requests entirely from memory. Being able to understand the performance characteristics of memory at least within an order of magnitude becomes increasingly important, because the average rate that you can service from memory becomes a critical metric for the scaling of your services.

It's more than even that though. As concurrency becomes increasingly important, locks and atomics become central to a lot of work. The last few iterations of Moore's law have changed the principles of how locks & CAS impact performance. It used to be that locks were bad, because you often burned up CPU going down to the kernel, but more than that because suddenly you are wasting CPU while you wait for something else to complete. CAS allowed you to largely sidestep all that CPU wasting and you're a happy man. Then CAS got integrated in to locks so that locks only really sucked when you had contention, but now... The real expensive cost of a lock these days is the memory fence put around it. CAS is better because the fence is scoped to only the object being swapped, but even that can be horribly expensive if there is contention on the object. Having a clear understanding of just how big that cost is relative to burning some more CPU, wasting some RAM, or even changing the constraints for your code, can be very important.

Trust me. This stuff really matters for any context where performance matters, and it is only going to get more important in the coming years. Memory has pretty much already trumped CPU in terms of its importance for most applications, and the number of cases for which it is more important than disk is growing at a rapid rate.

btilly · on May 3, 2012

I believe you.

However I've spent most of my life working within scripting languages. And the same is probably true of most programmers in the startup world.

If you're able to get away with using a scripting language, then your problems had better not be performance critical, because your swiss army knife of choice really sucks as a saw...

aredridel · on May 3, 2012

Yeah, web apps mostly get to ignore stuff like this until they're really big. But then sometimes you're a startup like Akiban Technologies and this is what you do.

cbsmith · on May 3, 2012

First, the notion that using a scripting language means that performance isn't an issue is a canard. It ought to mean your revenue isn't directly correlated to performance, but that's about it. Indeed, if you are using a scripting language, the odds are more important that you understand the trade offs, as your lack of low level options means you need to make smart choices about high level options.

Second, very few startups don't end up making decisions about performance even in the early days. What? You're using a NoSQL store? Is that because you just hate yourself or because you thought it'd be more efficient/scale better than an RDBMS? Why aren't you just storing it all in memory? Wait, you can run out of that? just let it swap then! Actually, why not just store all the data in a flat file? You don't even need to sort it, you can just scan through it whenever you need to, because performance doesn't matter. Why are you using an associative array there? Do you care about having efficient lookups based on a key or something?

You see my point.

I get that some people are playing around in a segment of startup land where programming isn't really that important, but presumably your long term plan isn't for every startup you participate in to never go anywhere. What if you are a success? Do you still want to be on the technical side when you've now got a team of developers and a serious customer base? If not, that's fine and likely a great career plan, but you aren't a programmer in the professional sense (which is what the paper is referring to). If you are planning on staying on that side, either learn this stuff or expect to be cast aside and not grow with the organization, because you aren't ready for the bigger world you'd now be playing in.

To be clear here, I'm not suggesting you'd have to pick up a systems programming language. Heck, when I started my career two decades ago, when CPU's were a lot slower and interpreters weren't as efficient as they are now, there was already this notion that "oh, I'm working in an interpreted language which provides a layer of abstraction so I don't need to know how computers work". I made a good start of my career (and met a lot of people who had been doing so for quite some time) essentially kicking those guys out of their jobs because it turned out doing scripting with an understanding of the computing consequences of your choices is applicable in a very broad set of situations, but scripting without that understanding is barely more useful than knowing how to use Excel.

High level languages actually can kick quite a bit of butt in performance critical situations, because they let you focus on the bigger picture of how you are executing, where the bigger performance wins can be had (I've seen "scripting languages" winning performance bake-offs for exactly this reason), but you have to know what the heck you are doing. People often make the mistake of thinking this won't matter. They tend to have very short careers.

btilly · on May 4, 2012

Within the limitations of a scripting language, of course you try to get the best performance that you can.

However in general within a scripting language you are not trying to optimize things at the CPU cache level. You're worried about overall efficiency of your algorithms, avoiding disk, identifying performance bottlenecks, etc. But the pervasive use of hashes everywhere and memory hungry data structures makes careful use of CPU caches pretty much a lost cause.

Furthermore the language does so much behind your back that an analysis of performance from first principles is very unlikely to be right. Instead you need to benchmark.

Retric · on May 2, 2012

That's a ridiculously contrived example, in the real world binary tree's are generally terrible data-structures. But, for most programmers most of the time even knowing what data structure is being used ahead of time is a waste of time.

PS: Wait, so your hand rolling your own caching system. That's nice, your fired.

ajross · on May 2, 2012

OK, let's roll with that: you have a better idea for an in-memory data structure with better performance than a RB- or AVL- tree? How much faster? How many DRAM cycles does it save?

You can't defend your point without understanding the hardware works, you know. Either it's a "terrible data structure" for DRAM stores (and you know why) or "this is all a waste of time" and it's not a terrible data structure by definition. Which is it?

Retric · on May 2, 2012

A Hybrid data structure. How many DRAM cycles does it save? Depends on your data and access patterns, 10-20 is reasonable, but it also saves processing time etc.

EX: Let's say you want to store 1/2 to 100 million GUID's. They should have a random distribution so you don't have to go crazy just start with a lookup table for the first 16 bits (sized to fit into L2 cache) and a radix tree on the leaf nodes. http://en.wikipedia.org/wiki/Radix_tree or a AVL-tree or whatever you want it's just not that important because these small trees are unlikely be cached. You may want to use a mid range of lockup tables to cut down on DRAM round trips, but again that's the type of micro optimization that starts to get really specific to your usage patterns and the amount of memory on your system etc.

But wait, what if your using time-stamps? The first N digits are going to be the same right? Again, knowing the DRAM timings is pointless verify its in the range you want, mask some middle bits, use that as an index... etc. Or, if it's a pure lookup and it's ok to give up ordering you can flip the bit order to have a fairly random distribution. (Again a micro optimization based on data usage patterns but it may help out.)

And, here is the thing if your storing the data on a HDD the same idea is still useful, just use a larger index in RAM. The important things is relative latency's and storage space not specific timings.

PS: Still, this assumes speed is really important. Generally speaking there are far more important things to worry about like keeping all your data in RAM in the first place, space efficiency can be more important than access patterns. And you might be better creating a batch of query's and ordering them to be more efficient vs changing how your data is stored.

timtadh · on May 2, 2012

"you have a better idea for an in-memory data structure"

The problem with this question is (and your original example) is it entirely contrived. A better in-memory data structure for what? String indexings? Integers keys? Complex objects? Metric data? Region Data? What is your query load? Range queries? Exact queries? Substring queries? K-edits queries? Nearest Neighbors? etc....

Before worrying about DRAM performance on the particular machine you may be running your code on for a year or two in production you need to answer those questions. The performance of your system will depend far more on your understanding of the work load (read vs write), object type and query types than on the data structures understanding of the DRAM. That is the last optimisation you make, after you have done everything else because it is machine specific.

ajross · on May 2, 2012

That logic makes no sense to me. By that definition any focused question is "entirely contrived". So I'll throw it out to you: pick one. Come up with any large in-memory data store (any structure, any stored data, any indexing) and give me an answer about its performance that doesn't require knowledge of how DRAM works.

(Edit: signing off this ridiculous flame war. We're now in full circular logic mode: examples showing the utility of DRAM knowledge are contrived because DRAM knowlege is never useful. Sigh. I'll just point out that, because I know how this stuff works, you are wrong. You can do your jobs without being a good hacker, but you'll never be a great hacker if you don't know how your machinery works.)

barrkel · on May 2, 2012

To a first approximation, all focused questions that rely on knowing DRAM timing details are entirely contrived. That's because almost no programmers need to focus on performance on hardware that specific, outside of people targeting mass produced identical hardware, like consoles.

Trying to predict performance from first principles is a mug's game in any case; there will be things you didn't expect. It's far better to measure.

shabble · on May 3, 2012

Wouldn't you rather have at least an approximate map showing the more swampy areas on your random walk up Performance Mountain (or is it down into Performance Valley? The sight of all these top-rooted trees is confusing me)?

barrkel · on May 3, 2012

I'm criticizing basing your architectural decisions on things as low-level as specific DRAM timings, unless perhaps you're targeting a very focused, extremely mass market niche, like a console.

I never said that one must be ignorant of factors that can affect performance when measuring. Rather, don't try and predict from first principles without measuring.

cbsmith · on May 3, 2012

So, by this logic, when writing code gor a variety of CPU/memory/storage/GPU platforms, there really isn't much point in worrying about performance at all?

Good luck with that interview at Oracle/Google/Facebook/Twitter/Activision/Adobe/Microsoft/AnyoneWhoKnowsAnythingAboutSoftware....

barrkel · on May 3, 2012

I never said there's not much point worrying about performance. What I said is that working from first principles at the level of detail of DRAM refresh timings is likely to be less productive than measuring different approaches based on higher-level understanding, because there will likely be more low-level details than you expected. But with higher level stuff, like the basics of CPU caching / associativity / risks of cache sloshing etc., you can make some reasonable design decisions - as long as you measure and experiment, continuously, throughout the process.

The window within which caching effects occur varies from machine to machine - specifically, certain access patterns fall off a cliff once you go beyond the window - but with appropriate design, you can take reliable advantage of caching behaviour without needing to target a specific cache size.

cbsmith · on May 3, 2012

I don't think the question was looking for a precise answer, but rather an approximate one. As he said, most programmers don't even get in the ballpark, and you'll note his question didn't mention cache or cache size. In fact, the way the question is phrased, you could answer for a system that had no cache whatsoever if you wanted to, though I'd go with some common cache architecture.

sbov · on May 2, 2012

When I run across a problem like the above, surely it's worth a few hours of my time, at that time. If it becomes a habitual problem, I no longer have to spend a few hours of my time.

I think I did - 8 years ago. The problem isn't that I'm lazy, it's that if I have no reason to use something for a long enough period of time, I tend to forget it. Maybe you remember everything you read forever?

There's already a plethora of information that is worth a few hours of my time - the problem is there isn't enough hours in the day. I don't want to spend them on things I'll not use and soon forget.

ajross · on May 2, 2012

I certainly don't remember things forever. But I generally find that the hard concepts stick around vestigially. Having read this (admittedly weighty) paper once, you'll find that going back in 4 years to check something isn't nearly so bad.

Really, read it. It's good for you. Like vegetables.

Deestan · on May 3, 2012

No. I already have a ton of other vegetables on my plate. Lambda calculus, compiler architecture, Haskell arrows.

illumen · on May 2, 2012

Other things like cache coherency, number and type of cores, other things running on the system, compression or reduction based on properties of the data all have an effect on what is possible. Not to mention weird things like some specialised FPGA, GPU, network card, or even SSD that might have DMA'd the memory. Various bus types depending on the cpus can also have a massive effect - not to mention memory access patterns your code might use. What about the OS level abilities that you're allowed to use? Is ring 0, or setting the affinity of processes, or enabling Turbo mode allowed? Maybe you don't give a shit about the data structure at all, and are just searching for a word with hyper-optimised grep will win. Then taking advantage of that, you can cheat and get something even faster. As you can see, even getting into the level of detail you mention, there is still a shit-load of extra details that might be needed to take full advantage of the hardware.

</arguing>

klochner · on May 2, 2012

   Surely that's worth a few hours of your time. At the very
   least it's worth more than arguing on the internet about 
   how you don't need to know it.

That's a great point.

Regarding your interview question, the kind of optimization you're getting into isn't something 99% of 'programmers' will ever touch.

ajross · on May 2, 2012

I suspect you can spin the demographics in lots of directions. But I'd hope that a far higher percentage of startup technical founders (the audience to which the comment was directed) would need to do non-trivial performance estimation in the course of building their products.

efuquen · on May 2, 2012

I think the point is that they don't.

Tell me any of the sites you visit in which this knowledge would be critical to deploying whatever service/product they're offering? I would imagine it would be pretty close to 0.

cube13 · on May 2, 2012

Google. Amazon. Bing. Twitter. Facebook.

When you start hitting massive amounts of traffic, these kinds of optimizations can save you a TON of money in equipment and bandwidth.

lolcraft · on May 2, 2012

Facebook was implemented in PHP, and still is (though now compiled, whatever). Google was, at a time, implemented in Python. Twitter was built on Rails, probably even ran on MRI. Maybe they switched to Java, who knows. Hardly hardcore-optimizable languages.

Not to mention that premature optimization may be bad, but even when you need it, it's better to just write a few algorithms and test them against real data.

sliverstorm · on May 2, 2012

The point isn't so much to think about optimizing assembler, but rather making higher-level algorithmic choices that mesh with your platform. Even if you don't sift through the Ruby bytecode, you can still apply some knowledge about what kind of data structure will be better in which kind of situation.

For example, if posed with the question presented here, if you realize it will not be as fast as you might hope, maybe you can make the choice to build a data structure with much less depth- maybe you can fit it all in a hash table?- and achieve huge performance wins without once touching assembler. (With a 27-deep tree, you need to make 27 reads from memory to find your data; reducing the depth reduces the required reads)

(I am not a computer scientist, and I am not an expert in algorithms, but hopefully I'm not too far off here.)

btilly · on May 2, 2012

As far as I know, Google was never implemented in Python. However Python is one of the three major languages at Google. (Mostly used for non-performance critical scripting stuff. If they care about performance it is in C++ or Java.)

The others are correct.

ajross · on May 2, 2012

Last I checked, PHP, Python and Ruby were implemented on computers with DRAM memories. All that analysis applies to them too. "Following a reference" is a memory load (or several) in all languages.

balloot · on May 3, 2012

You seriously aren't willing to admit that your question has zero relevance to someone working in PHP?

The entire point of coding in a language like PHP is that you don't think about the low level stuff. And even if you did take the time to truly understand the source of PHP and made some fabulous optimization, they could change implementation details in a future release and break what you did.

Bottom line is that your interview question has almost no correlation with whether someone would be a good PHP developer. You might as well ask what the capital of Macedonia is as that is just as good of a screen for potential applicants.

thwarted · on May 2, 2012

More likely for people on that path, there's bigger wins from paying down technical debt that exists at a higher level than by trying to optimize your code's memory access patterns.

cube13 · on May 2, 2012

I agree completely. Frontend optimizations can probably give a lot more in the short term than these kinds of backend optimizations.

Unfortunately, even if it's paying down technical debt, if you're trying to squeeze out more cycles with your low-level code, a lot of the time it's issues on the algorithm level. When refactoring there, these kinds of optimizations should always be at the back of your head, even if it's just to explain why moving the for loop up 5 lines gave the function a 2% performance boost.

sliverstorm · on May 2, 2012

That depends. Money spent on hardware buys you linear performance improvements. Algorithmic improvements can buy you exponential performance improvements.

There is most certainly a point where money spent on hardware gives a better return, but all the hardware in the world couldn't net you acceptable runtimes using bubblesort on colossal datasets.

balloot · on May 3, 2012

Again, it's a very specialized gig. Each of those companies probably has a small handful of people who work on optimization and could answer the question we're talking about here. But the vast majority of employees in these companies work in higher level languages and never have to think about the specifics of memory management at all.

balloot · on May 3, 2012

That interview question would be completely irrelevant for 99.9% of software jobs.

Nobody is denying that this stuff is essential for some very specialized software teams. But the claim that it's something every software engineer needs to know is utterly ridiculous.

haberman · on May 3, 2012

> Give me a ballpark estimate for how many reads per second you can achieve from that tree.

Isn't that going to depend on a heck of a lot more than just the RAM latency? Like, for example, how much locality the data has and how big the caches are?

I love this low-level stuff, but I think it's pretty hard to estimate performance of anything from first principles on today's highly complex architectures. There is just too much interplay of confounding factors.

obtu · on May 3, 2012

He mentioned the data structure is large, so it's a safe assumption to say the cpu caches won't be useful except maybe for the very top of the tree, and that there is no locality when traversing a tree path.

sho_hn · on May 2, 2012

The title isn't need, it's should. Both are debatable of course, but I'd argue that should is a lot more defensible: Abstractions are leaky, and even if you can get by without knowing what's going on down there, you'll at the very least sharpen your mind and be more surefooted if you do.

What you're saying is a bit like the but I'll never need math in life argument. You're going to say that math is a lot more fundamental to existence then computer memory is of course, and that's true, but if you're looking at just computers, the way the memory system works is a fundamental part of the whole not entirely unlike math is to understanding nature. If you really want to know what you're dealing with and make sure you're not limiting yourself, you want to be inquisitive and go there.

Basically, I don't really get why the gut reaction to these sort of things (especially if they actually have meat; there's of course a lot of crummy pretenders) is so often to declare them pointless. Yes, you need to filter your information intake, but still, better err on the side of knowing too much.

prophetjohn · on May 2, 2012

No one is declaring this information pointless, but it's certainly a lot less relevant to Every Programmer than it was when this was written in 2007. I've actually got an okay understanding of this stuff, but the number of times that I've split

    for(int i = 0; i < 10; i++)
    {
      j += i;
      k += i;
      l += i;
    }

into

    for(int i = 0; i < 10; i++)
      j += i;
    for(int i = 0; i < 10; i++)
      k += i;
    for(int i = 0; i < 10; i++)
      l += i;

to reduce cache misses is very small because with today's hardware, it very often just doesn't matter. Obviously there are many programmers whose primary job is to do this, because they need every clock cycle they can squeeze out of the thing. They're just not representative of Every Programmer.

vilya · on May 3, 2012

I'd argue it's more important now than when the white paper was first written. Where processor speed or memory size used to be the bottleneck, increasingly these days it's memory bandwidth instead, so managing that effectively is key to writing high performance programs.

cbsmith · on May 3, 2012

I was going to say something almost precisely like this. The main thing I was going to add is that it is both memory bandwidth and latency that have become important bottlenecks, and often the latter is more important. You know how people were saying "disk is the new tape"? Guess what the new disk is? RAM. Much as with disk latencies, RAM latencies haven't really improved that much over the last decade or so, though memory bandwidth and CPU power have. Guess what that means? ;-)

..but you can only know this if you understand what's going on underneath the hood. So if you don't, no need to worry; you won't even know when the moment of your utter irrelevance arrives.

garenp · on May 2, 2012

My typical response to the "I'll never need to know X" argument is similar, which is that having as broad and diverse understanding of a system as possible gives you imagination.

Without imagination, you will not be able to solve certain problems, not know where to even look when some arise, and at worse be totally unaware that better solutions exist.

That said, the OP's context is that of an "interview style" question which raises mental red flags for me. It's a common problem for interviewers to base their assessment of a candidate based on whether or not the interviewee has the same subset of knowledge that they do. In that case, I'd be quick to point out that:

  * knowledge != intelligence
  * knowledge != capability
  * knowledge != skills
  * correlation != causation

romaniv · on May 2, 2012

They don't need to cache-optimize, but knowing how stuff really works can help prevent some pathological cases of premature optimization in higher-level languages.

I've read most of this during my undegrad OS course, and found it absolutely fascinating.

sho_hn · on May 2, 2012

That's an awesome point I really wish I had made myself, so I'll just shamelessly rehash it to drive it home: Knowing how the system actually works down to a low level can help you avoid inadvertently writing code that will perform badly, even if you're dealing with code isn't otherwise required to perform above "reasonable" and thus doesn't call for low-level tuning.

Let's take database queries as an example: Knowing something about how the database is implemented will help you avoid writing queries that are pathologically slow, even if the queries you need to write aren't critical path enough to benefit from going to great length to tune them to the implementation.

vibrunazo · on May 2, 2012

> Every programmer should have a basic understanding of latencies.

Why exactly?

Aren't you making the same mistake of the author and confusing "every programmer" with "every low level programmer whose work is highly dependent on speed optimization"? Which, from my understanding are the minority of us. I don't even know how the APIs I use are implemented, let alone know how much "fetching from L2 cache" they're making.

But of course, I know very little about hardware. So I could be very mistaken. I would love to be proven wrong.

dgant · on May 2, 2012

There are very few software environments where the cost difference of "Read 1MB sequentially from disk" versus "Read 1MB sequentially from memory" is irrelevant.

prophetjohn · on May 2, 2012

If you're having problems with page faulting, you really just need to use less memory, not necessarily use it more intelligently. If you're programming so poorly that you're blowing through all the available system memory, the problem is likely a little more simple than you not knowing how paging, TLB, etc. work.

Either way, since the disk is so slow, virtual memory and paging are handled by the OS, not the hardware; so there's a lot less that the programmer can specifically control once you get that far down the memory hierarchy, as opposed to (say) avoiding cache misses.

btilly · on May 2, 2012

Not true!

A programmer who is aware of the latency issues around disks can figure out how to move data out of memory into files, and how to change algorithms to arrange for sequential access to data. One who is not aware has no way to figure out how to stop the month-end data processing run taking five days. (That is a real world example.)

prophetjohn · on May 2, 2012

To avoid disk latency, you're moving data out of memory and onto disk? That's essentially a coping mechanism for having stuff taking up space in memory that doesn't need to be in memory. Knowing how and when to do that is useful, but doesn't require a lot of knowledge of the memory hierarchy beyond "the disk is incredibly freaking slower than memory."

edit: and I just realized I'm in the subthread about knowing about latencies being important, not the subthread about knowing the entire contents of the linked paper being important, so yes, I think we're agreeing.

btilly · on May 2, 2012

Actually we are disagreeing.

Your claim was that once you're hitting problems with disk latency, there isn't a lot that a programmer can control. My point was that there actually is a ton that the programmer can control at that point.

In fact a programmer working in high level scripting languages benefits more from knowing about disk issues than everything else for the simple reason that they really don't have any control over everything else. Of course you hope that the people implementing the language understand everything else...

prophetjohn · on May 2, 2012

No, my point was that if you're running into problems with page faulting, there is less a programmer can do than if they were running into problems with cache misses. Specifically, what can be done is to use less memory, which would be an equally effective strategy for cache misses. The difference is the cache misses can also be avoided by modifying memory access patterns. Such a strategy is far less effective at controlling page faults and as such, the primary strategy a programmer has is to program in a manner that occupies less space in memory. Doing so does not require a detailed and fundamental understanding of the memory hierarchy and only really requires an understanding of the fact that "the disk is incredibly freaking slower than memory."

btilly · on May 2, 2012

I don't see the lack of a parallel that you are trying to make.

If I take data that was in RAM, explicitly move it into a file and then make sure that I access it sequentially, I've modified my memory access patterns in a way that controls page faults. I'm using the same amount of actual space in memory and on disk, I've just labeled some of it "file" rather than "RAM".

In fact if anything I'd argue it the other way. If you're using a high level scripting language, what I just said about page faults still works. However the language gives you so little control over what happens in memory that you can't control the cache misses.

prophetjohn · on May 2, 2012

I guess I'm not entirely sure what you're saying. You're avoiding instances where you're accessing data in such a sporadic way that causes it to repeatedly request data from a page not in memory? Then when that page is brought into memory, it requests data from a different page that is not in memory and so on? The principle of locality states that you're going to have to try pretty hard to make your memory accesses that far apart for it to ever happen. If that's what you're saying, then I'll concede that in this unlikely circumstance, memory access patterns can help; but really, it's more that your unusual access patterns were hurting you in the first place than that you're fixing a typical problem by tuning your access patterns.

The only other reason you're going to be having page faults is because you're trying to work with more data than can fit in memory. In this case, you will inevitably page fault. The only way to avoid that is to use less memory (or install more memory). So much less that all your data can fit in main memory at a given time.

btilly · on May 2, 2012

You keep saying "the only way" when there are other ways.

Let me give a concrete example.

You are page faulting because your data is in a large disk-backed hash. By the nature of a hash, you can access any part of it. Convert that hash to a B-tree, and similar requests are more likely to be close in memory. This can be a big win if there is a non-random access pattern (for instance you are more likely to look up recent records than old ones). Performance may magically get much better. You haven't changed how much memory you need, your data set, etc. But page faults are reduced.

More drastically, rather than going through one data set doing random lookups in another, you can sort both data sets and then process them in parallel. This is a much, much bigger change in the program. However it can result in many orders of magnitude better performance.

I am drawing these from a concrete example. I was looking at a month-end job that did summary end of month statistics for a busy website. Substituting a B-tree for a hash made it go from 5 days to something like 3 days. Not bad for under 5 minutes work. (I just changed the parameters that opened up the database file to use a different format.) Later I took the time to completely rewrite it, and that version took about an hour. Through this process there were no hardware improvements.

So you see, understanding disk performance issues is not useless for programmers. You have options for page faults other than asking for better hardware.

prophetjohn · on May 3, 2012

Now it's definitely you who is misunderstanding. The scenario you just explained is a very nice example of the caveat that I allowed for. You've found an uncommon case where you're accessing memory in such a way to defy spatial and temporal locality and are experiencing page faults at a rate greater than those that are due to your data set simply being larger than your memory.

However, provided that your hash is greater than the size of your memory, you are still inevitably going to page fault if you expect to access every element of that hash. In that case, "the only way" to avoid page faults is to either increase the size of your memory or to use less of it.

    understanding disk performance issues is not useless for programmers

You're right; that's a fantastic point. The problem is that no one made that claim.

btilly · on May 3, 2012

Go back to http://news.ycombinator.com/item?id=3920319 to see the claim that I disagreed with. There is a lot that can be controlled at the page fault level other than just "use less memory". And having a lot of data in a hash is hardly an uncommon use case if someone's processing, for instance, a log file.

prophetjohn · on May 3, 2012

I mean, I know what claim you disagreed with. We just had a conversation spanning 9 hours where we hashed out (heh heh) what I actually meant in that post. I don't want to reset back to that post and start explaining from the beginning because I already typed all those words explaining what I meant.

And I dunno, man, you have big log files. I don't think that processing many-gigabyte log files is what most would call a common case, but I coud be totally wrong here.

grundprinzip · on May 2, 2012

We use this full document very often during teaching in graduate-level courses and it always helped the students understanding the underlying concepts of data access. Even though for most of the people a thorough understanding of DRAM refresh latencies is not important, it is still a very, very important read for every programmer.

Why?

The answer is easy: Almost everything in modern computers is about locality spatial and temporal locality. As soon as the complexity of the programs you write is one level above "Hello World" or in Rails-speak a simple controller method. This will become important.

It's easy to translate the concepts of aligning data in DRAM because it's faster to Rails-ish behavior. Assume that you read lot's of data from your database and you process it item by item. Inside your loop you perform another fetch from the database, again and again. If you would try to join the data (speak aligning) than you would need less requests to the database.

Any system architecture that involves handling data will at one point in time come to the situation where the programmer will think about ordering instructions, database queries or even attributes inside a c-struct (see the discussion about why short ruby strings are faster than long...) If you won't keep anything in your mind from this document, but if you remember that sequential access and exploiting locality will increase the performance, then the document made it's point.

Take the hours, read the document, and probably forget most, but take it as an inspiration, rather than an optimization guide.

Roboprog · on May 2, 2012

I used to think about locality a fair amount back when I programmed in C.

In Java, I have no idea whether or not the fields in a copybook^H^H^H^H^H^H^H^H bean are going to be anywhere near contiguous, or not. Packing and unpacking stuff in a byte array / string gets tedious.

It's a shame low level integration isn't somehow easier in Java, like it is in C#.

intractable · on May 2, 2012

Ahh, the "What every programmer should know about X" articles.

So, Ulrich Drepper thinks everyone should know about the intricacies of DRAM refreshing. Zed Shaw thinks we should all know statistics. Some other schmoe thinks we should know about SEO. Et cetera.

Posts like these should be titled "I worked hard and feel good about myself for knowing X, therefore everyone should know it."

It is blisteringly stupid. Every programmer needs to know WHAT THEY NEED TO KNOW.

Nobody can know everything. Why don't we all just learn what we need, build the things which interest us, and stop telling others what THEY need?

Having said that... interesting stuff. Just needs a title change.

oakenshield · on May 2, 2012

It's on LWN. 90% of programmers who go there really do need to know memory in the detail Drepper talks about. I see your point though - I fully expected the title to be linkbait to a hastily written article.

This article, however, is the real deal. Drepper's site (http://www.akkadia.org/drepper/) won't win any design competitions but is a goldmine of systems programming wisdom.

vilya · on May 3, 2012

I'm surprised to see so many people here getting hung up on the title of this paper, of all things, when there's so much good information in it.

The title is just the author's opinion. Deal with it. And then read the paper anyway, because even if you don't use it directly it will help make you a better programmer.

ajross · on May 2, 2012

These are old, but great. Drepper's history with glibc is a little checkered, but his whitepapers (there's an equally great one on the NPTL work from back when that was being done in glibc) are gold.

Understanding the DRAM cycle is critical to performance tuning in modern code, and my experience is that virtually no application developers know how it works.

tomerv · on May 2, 2012

I wonder if PCs will ever have non-linear memory.

To explain what I mean, suppose you have a big 2-dimensional array in memory. As it works now, if you read a cell from memory, the area in memory around it is brought into the cache. Because of the way that arrays are represented in memory, what you get in the cache is just cells from the same row (or column). This means that if you need to access neighboring cells above of below the current cell you're not making any use of the cache. The way I see it, it might be useful if the system could somehow instruct the memory to bring a different-shaped block into the cache. Maybe there will be a whole part of memory intended for 2-dimensional arrays, and where blocks of memory will be allocated as squares on the plane (instead of segments on a line). Of course, something like this would require support from the programming language, which would need to mark put 2-dimensional arrays in that special part of the memory (and allocation will be much harder, now that you need to allocate 2-dimensional chunks).

ajross · on May 2, 2012

That kind of caching architecture exists for graphics hardware already. GPUs have modes where the natural breakdown of memory is 2D (basically by interleaving the X and Y bits), and indeed that improves cache locality for 2D accesses. But not by a whole lot.

The reason caches come in "lines" that are bigger than the words size of the processor is not an optimization (i.e. they're not deliberately bringing in nearby memory in the hope that it will be useful). The reason is that the logic associated with maintaining the cache (index and tag bits) is expensive, and needs to be repeated for each "thing" you have in the cache. So by making the "things" bigger (e.g. the 512 bit cache line on modern CPUs) you reduce the overhead of the bookkeeping.

Someone · on May 2, 2012

"The reason caches come in "lines" that are bigger than the words size of the processor is not an optimization (i.e. they're not deliberately bringing in nearby memory in the hope that it will be useful)."

I do not think that is true. There is considerable chance that that extra memory will be useful. The simplest example to think of are cache lines that contain program code. The other canonical example is the "for item in array do item = f(x)" loop.

ajross · on May 3, 2012

No, there's hardware to do that too (speculative reads), but it's even more complicated. Really the driver of cache line size is simple efficiency. Software has to jump through lots of hoops in practice to try to align accesses to cache lines, and that could be avoided if memory could be efficiently cached in word-sized chunks.

Someone · on May 3, 2012

I still doubt that. IMO, cache lines are larger than the largest item a CPU can read because the probability that the extra data will be needed soon is high enough to offset the extra work needed to read that larger cache line and the (few) transistors needed to increase cache line size.

In some sense, large cache lines are just cheap ways to implement speculative reads.

See for example http://dl.acm.org/citation.cfm?id=255272 (pay walled, but the abstract is clear enough):

"A significant observation is that, although increasing line sizes can result in a higher hit ratio, it also considerably increases traffic to main memory, thereby degrading the performance."

Someone · on May 2, 2012

Not what you describe, but tries to solve the same problem: PowerPC has instructions that tell the system what memory you are planning to use. See for example https://developer.apple.com/hardwaredrivers/ve/caches.html

There also are CPUs that attempt to detect access patterns without help from the programmer, for example Power: http://chipdesignmag.com/display.php?articleId=3926&issu....

I think Intel CPUs have similar functionality, but these were the ones I found first (using "cache stride hints" as Google query), perhaps because Intel uses different terminology.

jiggy2011 · on May 2, 2012

If every programmer actually had to know the contents of every "what every programmer should know about X" articles before programming then we wouldn't have any programs.

rtkwe · on May 2, 2012

http://www.scribd.com/doc/92111472

I'll just leave this here, PDF version for anyone who doesn't want to mess with the webpage reformatting.

olalonde · on May 2, 2012

This was posted almost 2 years ago: http://news.ycombinator.com/item?id=1511990

balloot · on May 3, 2012

It seems that every day on HN there is some post offering something "every programmer needs to know", yet the post is actually some very in-depth dive into the minutiae of the author's pet subject.

The other day there was some post on front end/client side programming with a crazy amount of things everyone needs to know. Now apparently we also have to know the characteristics of the charging/discharging of capacitors in memory.

All in all, this is WAY too in depth too have the title it has. The only way you have to know all this stuff as a software engineer is if you are working on the lowest level of something like an OS kernel or a video game engine.

goggles99 · on May 2, 2012

Today, this is what every "low level programmer" should know. If you are using a high level language with built in memory management and garbage collection, a condensed subset of this will suffice as only a fraction of the content will be helpful to you.