The whole article is based around the idea the machine is dedicated to one large process, and that the working set fits in RAM. If that is the case, great turn on boot time 1GB page preallocation and have don't look back.
If either of those two assumptions are false, then the TLB miss times are swamped by the page in time. Paging a 1GB page in is _NOT_ a fast operation, especially when only a tiny percentage of the data is going to be touched, or its promptly going to be paged out again. If he has a machine with 32GB of ram he should retest with a 64GB working set.
The whole article is about optimizing a 1) large memory 2) single process that 3) dominates the CPU time on one machine where 4) the entire working set fits in ram. If any of those FOUR conditions are not met, then the conclusions do not apply.
That being said, we have an in-memory custom content db where this tweak might really make a difference..
You seem to be one of those people who thinks whether you have 1G, 2G, 4G, 8G, 16G, 32G, 64G, or 128G... of RAM, you'll still page out to disk eventually.
How much exactly do you consider to be the minimum 'enough' RAM where you won't need to page out to disk? 128 exabytes?
Paging to/from disk is currently a fundamental portion of all the major OS's, and significantly increases ram efficiency (even for executable pages). I still see a fair number of server applications oversubscribing RAM. Particularly in DB, or VM'ed environments.
I'm not saying 4K pages are good and 1G ones are bad, but there are a fair number of applications that probably benefit from something in between. 1GB is probably on the extreme side of things.
That doesn't mean you need 50TB of RAM, because 99% of the records could be inactive. Instead you let the hardware page things in, and the portions of the database that are regularly used will stay in RAM, while the rest remains on SSD. The arches with more page selection choices can actually be a big selling point for non x86 servers in certain cases.
In the end, retained data is still growing, and a fair number of applications don't fit into mapreduce or other partitioning schemes. So, I would say paging is going to remain useful for some portion of the servers in existence for at least a few more years.
BTW: I recently worked on an application which would pretty much eat as much RAM (enormous hash table) as it was given and still ask for more. In the end shipping a fairly normal machine (32G-64G) with a 4TB PCIe based SSD provided sufficient performance that we didn't need to spend 100x on a machine that could take 4TB of RAM. So there are economic arguments as well.
Can you share any more about your use case here? Intrigued, as a colleague and I have recently been working on highly space-efficient hash maps for a bioinformatics application that it strikes me could be relevant (similar problem -- huge reference set that needs to be accessible with very low access times).
Paging is not fast, period. I think it's a reasonable assumption that the machine is not paging. That said, if you rely on paging, then your comment holds.
Ok, from the comments, it's advocating for 1GB pages at the main memory. Not cache, not disk, not network. Main memory.
For me it looks too big - entire servers will have about 32 pages, swapping will take ages at 400MB/s disks. Current PCs use a too small page, but 4MB seems a much more realistic number.
The available page sizes are set by the hardware. On x86-64, your current options are 4kB (4-level page tables), 2MB (3-level page tables) and 1GB (2-level page tables).
I'm actually kinda surprised there's no nyud.net-style automatic cache for links posted on Hacker News yet. I'm tempted to buy a cheap used PC and set one up at home... Is there some open source code to implement this aside from curl?
A reverse proxy that did simple transform on the submission URLs? There are mutating reverse proxy guides all over the net thanks to the UK torrent site blockade.
I think it is much simpler. 4 KB pages are small today, and 1 GB is still very large for most processes. 2 MB sounds about right (gut check, how much memory does the average process allocate, and for those small processes, is 2 MB that much overhead really?). Unless the number of TLB entries drops as page size increases, then larger pages make sense. It's simple: 1 GB risks wasting memory when the process doesn't need that much. 2 MB is good in 2014.
I think it's a bummer that there aren't some option in between. I could see 64 meg pages being really nice.
Ultimately it would be handy to be able to tune page size for the loads that you see. I could see page sizes jumping by 4x or 16x (2 bits or 4 bits) each time being reasonable.
The real issue the author is talking about isn't "how much memory does a process allocate" but rather "how many total pages does the OS have to keep track of and what percentage of those fit in the TLB at any one time?"
If the real issue is simply to minimize the number of pages to track (and thereby maximize TLB hits) then its very simple: go to 1 GB (or higher!) pages.
This hints that there is a tradeoff happening here. At the highest level, the tradeoff is between having efficient use of memory and TLB hits. Big pages give TLB hits, small pages make efficient use of memory.
Since the TLB is in hardware, it is more difficult to have the fine-grain tuning you desire.
PowerPC on Fedora now defaults to 64k pages, I happened to notice the other day, which surprised me. But x86 doesnt have convenient size pages that would work for default, as 2M is a bit big probably.
Interesting, if one ever needs to boost a memcache/Redis instance, this might actually work.
But what about virtualization? Can I use 1GB pages in a guest OS, or will the host OS still handle everything with 4k pages, nullifying any advantages?
Hardware support for virtualization is actually one of the main things driving huge pages. For a TLB miss in a guest, you end up doing a nested page table walk. This is much more expensive with 4 KB pages (i.e. as opposed to 2 MB).
Short answer: huge pages are a big win for virtualization.
Given a background in programming, you can ramp up on this in an afternoon. Here's an intro "crib sheet"
Your application believes that it has all the RAM to itself. This is a lie that the operating system and hardware tell your application to decouple the physical RAM addresses and the ones your application uses (virtual RAM addresses). Learn more about virtual memory here: http://en.wikipedia.org/wiki/Virtual_memory
In order to keep this mirage working, the computer needs to map from virtual address to physical address. Instead of tracking every single address, it tracks spans of addresses. So, the address your application sees as 0 to 4096 will map to physical address 5000 to 9096. Keeping this map using fixed-size spans keeps the size of the mapping down and the performance fast.
This article is about using bigger spans (0 to about 1 billion) instead of the standard 4kb. The advantage of this is that the mapping from virtual to physical is stored in memory as a tree and bigger spans mean you need fewer nodes in the tree. Fewer nodes means you have less traversals/indirection to find the node you are looking for. Less work means faster performance.
The details about the caching and the counts of TLB in the processor has to do with how much dedicated space is in different parts of the CPU for this mapping information.
The details about offsets and changing how the memory was accessed in order get positive / negative performance in the tradeoff of 4kb vs 1gb have to do with wether the mapping information was in the cache or not. it is similar to alignment: http://en.wikipedia.org/wiki/Data_structure_alignment
Finally, in order to use these 1gb maps instead of 4kb maps, the programmer has to leverage special way of allocating memory from the operating system called mmap http://en.wikipedia.org/wiki/Mmap
Not too bad, actually. For a moment I was thinking about how it might be useful to embed large datasets directly in the page, but it wouldn't be even remotely worth the sacrifice in usability. Just make one extra HTTP request and give the user a nice spinny icon.
If either of those two assumptions are false, then the TLB miss times are swamped by the page in time. Paging a 1GB page in is _NOT_ a fast operation, especially when only a tiny percentage of the data is going to be touched, or its promptly going to be paged out again. If he has a machine with 32GB of ram he should retest with a 64GB working set.