How Bad Can 1 GB Pages Be?

StillBored · on Feb 23, 2014

The whole article is based around the idea the machine is dedicated to one large process, and that the working set fits in RAM. If that is the case, great turn on boot time 1GB page preallocation and have don't look back.

If either of those two assumptions are false, then the TLB miss times are swamped by the page in time. Paging a 1GB page in is _NOT_ a fast operation, especially when only a tiny percentage of the data is going to be touched, or its promptly going to be paged out again. If he has a machine with 32GB of ram he should retest with a 64GB working set.

aaronblohowiak · on Feb 23, 2014

The whole article is about optimizing a 1) large memory 2) single process that 3) dominates the CPU time on one machine where 4) the entire working set fits in ram. If any of those FOUR conditions are not met, then the conclusions do not apply.

That being said, we have an in-memory custom content db where this tweak might really make a difference..

wfunction · on Feb 23, 2014

You seem to be one of those people who thinks whether you have 1G, 2G, 4G, 8G, 16G, 32G, 64G, or 128G... of RAM, you'll still page out to disk eventually.

How much exactly do you consider to be the minimum 'enough' RAM where you won't need to page out to disk? 128 exabytes?

StillBored · on Feb 23, 2014

Paging to/from disk is currently a fundamental portion of all the major OS's, and significantly increases ram efficiency (even for executable pages). I still see a fair number of server applications oversubscribing RAM. Particularly in DB, or VM'ed environments.

I'm not saying 4K pages are good and 1G ones are bad, but there are a fair number of applications that probably benefit from something in between. 1GB is probably on the extreme side of things.

If you have a 50TB database you cannot put it in ram on anything common. The largest machine I've seen for sale takes 16TB of ram http://www-03.ibm.com/systems/power/hardware/795/specs.html.

That doesn't mean you need 50TB of RAM, because 99% of the records could be inactive. Instead you let the hardware page things in, and the portions of the database that are regularly used will stay in RAM, while the rest remains on SSD. The arches with more page selection choices can actually be a big selling point for non x86 servers in certain cases.

In the end, retained data is still growing, and a fair number of applications don't fit into mapreduce or other partitioning schemes. So, I would say paging is going to remain useful for some portion of the servers in existence for at least a few more years.

BTW: I recently worked on an application which would pretty much eat as much RAM (enormous hash table) as it was given and still ask for more. In the end shipping a fairly normal machine (32G-64G) with a 4TB PCIe based SSD provided sufficient performance that we didn't need to spend 100x on a machine that could take 4TB of RAM. So there are economic arguments as well.

boyd · on Feb 23, 2014

Can you share any more about your use case here? Intrigued, as a colleague and I have recently been working on highly space-efficient hash maps for a bioinformatics application that it strikes me could be relevant (similar problem -- huge reference set that needs to be accessible with very low access times).

Email is in my profile. Thx!

wfunction · on Feb 23, 2014

> If you have a 50TB database you cannot put it in ram on anything common.

Well no kidding, if you're Google then you'd need petabytes of RAM to keep everything in memory.

That doesn't mean the scenario the article was addressing was atypical/unlikely in any way, or that the one you're addressing is more common.

etep · on Feb 23, 2014

Paging is not fast, period. I think it's a reasonable assumption that the machine is not paging. That said, if you rely on paging, then your comment holds.

ithkuil · on Feb 23, 2014

You can have multiple large processes whose sum of the working sets fit in RAM.

marcosdumay · on Feb 24, 2014

Not if you lose most of the RAM to fragmentation.

_xhok · on Feb 23, 2014

Site's down? I thought it was making a point by making the page 1 GB to download.

_hoa8 · on Feb 23, 2014

Site seems down. Here's the cached version http://webcache.googleusercontent.com/search?q=cache:Frz8Fde...

Ethan_Mick · on Feb 23, 2014

I thought it was just trying to load 1gb and I was quite scared.

ricny046 · on Feb 23, 2014

I thought that too! :-)

marcosdumay · on Feb 23, 2014

The cache does no work for me either...

Ok, from the comments, it's advocating for 1GB pages at the main memory. Not cache, not disk, not network. Main memory.

For me it looks too big - entire servers will have about 32 pages, swapping will take ages at 400MB/s disks. Current PCs use a too small page, but 4MB seems a much more realistic number.

caf · on Feb 24, 2014

The available page sizes are set by the hardware. On x86-64, your current options are 4kB (4-level page tables), 2MB (3-level page tables) and 1GB (2-level page tables).

erichocean · on Feb 23, 2014

The purpose of 1GB pages is to eliminate TLB pressure for apps that use a lot of RAM, not to swap it out to disk.

cynwoody · on Feb 23, 2014

Try the cache text-only: http://webcache.googleusercontent.com/search?q=cache:Frz8Fde...

im3w1l · on Feb 23, 2014

Often text-only google cache works when normal google cache doesn't. Maybe try that?

kalleboo · on Feb 23, 2014

I'm actually kinda surprised there's no nyud.net-style automatic cache for links posted on Hacker News yet. I'm tempted to buy a cheap used PC and set one up at home... Is there some open source code to implement this aside from curl?

_hoa8 · on Feb 23, 2014

I wrote an API for HN [1]. Maybe you can use that somehow?

[1] https://github.com/karan/HackerNewsAPI

nly · on Feb 24, 2014

A reverse proxy that did simple transform on the submission URLs? There are mutating reverse proxy guides all over the net thanks to the UK torrent site blockade.

etep · on Feb 23, 2014

I think it is much simpler. 4 KB pages are small today, and 1 GB is still very large for most processes. 2 MB sounds about right (gut check, how much memory does the average process allocate, and for those small processes, is 2 MB that much overhead really?). Unless the number of TLB entries drops as page size increases, then larger pages make sense. It's simple: 1 GB risks wasting memory when the process doesn't need that much. 2 MB is good in 2014.

msandford · on Feb 24, 2014

I think it's a bummer that there aren't some option in between. I could see 64 meg pages being really nice.

Ultimately it would be handy to be able to tune page size for the loads that you see. I could see page sizes jumping by 4x or 16x (2 bits or 4 bits) each time being reasonable.

The real issue the author is talking about isn't "how much memory does a process allocate" but rather "how many total pages does the OS have to keep track of and what percentage of those fit in the TLB at any one time?"

etep · on Feb 24, 2014

If the real issue is simply to minimize the number of pages to track (and thereby maximize TLB hits) then its very simple: go to 1 GB (or higher!) pages.

This hints that there is a tradeoff happening here. At the highest level, the tradeoff is between having efficient use of memory and TLB hits. Big pages give TLB hits, small pages make efficient use of memory.

Since the TLB is in hardware, it is more difficult to have the fine-grain tuning you desire.

justincormack · on Feb 23, 2014

PowerPC on Fedora now defaults to 64k pages, I happened to notice the other day, which surprised me. But x86 doesnt have convenient size pages that would work for default, as 2M is a bit big probably.

MrBuddyCasino · on Feb 23, 2014

Interesting, if one ever needs to boost a memcache/Redis instance, this might actually work.

But what about virtualization? Can I use 1GB pages in a guest OS, or will the host OS still handle everything with 4k pages, nullifying any advantages?

etep · on Feb 23, 2014

Hardware support for virtualization is actually one of the main things driving huge pages. For a TLB miss in a guest, you end up doing a nested page table walk. This is much more expensive with 4 KB pages (i.e. as opposed to 2 MB).

Short answer: huge pages are a big win for virtualization.

MrBuddyCasino · on Feb 23, 2014

Thanks, but do I have to enable them on the host or the guest OS? Or both? Sorry, I'm not familiar with the details of HW virtualization.

etep · on Feb 23, 2014

Both would be best, but they are actually independent.

DrJones1098 · on Feb 24, 2014

The day I fully understand this entire blog post is the day I will consider myself in the leet category.

aaronblohowiak · on Feb 24, 2014

Given a background in programming, you can ramp up on this in an afternoon. Here's an intro "crib sheet"

Your application believes that it has all the RAM to itself. This is a lie that the operating system and hardware tell your application to decouple the physical RAM addresses and the ones your application uses (virtual RAM addresses). Learn more about virtual memory here: http://en.wikipedia.org/wiki/Virtual_memory

In order to keep this mirage working, the computer needs to map from virtual address to physical address. Instead of tracking every single address, it tracks spans of addresses. So, the address your application sees as 0 to 4096 will map to physical address 5000 to 9096. Keeping this map using fixed-size spans keeps the size of the mapping down and the performance fast.

This article is about using bigger spans (0 to about 1 billion) instead of the standard 4kb. The advantage of this is that the mapping from virtual to physical is stored in memory as a tree and bigger spans mean you need fewer nodes in the tree. Fewer nodes means you have less traversals/indirection to find the node you are looking for. Less work means faster performance.

The details about the caching and the counts of TLB in the processor has to do with how much dedicated space is in different parts of the CPU for this mapping information.

The details about offsets and changing how the memory was accessed in order get positive / negative performance in the tradeoff of 4kb vs 1gb have to do with wether the mapping information was in the cache or not. it is similar to alignment: http://en.wikipedia.org/wiki/Data_structure_alignment

A lot of the obscure parts of the code are just how the author is calculating addresses to read using pointer arithmatic http://en.wikipedia.org/wiki/Pointer_(computer_programming)#... and bit-shifting http://en.wikipedia.org/wiki/Bitwise_operation

Finally, in order to use these 1gb maps instead of 4kb maps, the programmer has to leverage special way of allocating memory from the operating system called mmap http://en.wikipedia.org/wiki/Mmap

zippie · on Feb 23, 2014

Supporting benchmarks that might be interesting alongside the OP:

https://github.com/johnj/llds#wall-timings-in-seconds

Same concept applies, reducing translations for page lookups reduces latency.

blueskin_ · on Feb 24, 2014

I was wondering if they meant pages as in websites. I was thinking "That's a lot of superfluous javascript rubbish to load...".

The site is loading as if the page was 1GB though.

theon144 · on Feb 23, 2014

I admit I was a bit afraid to click this link on mobile...

lafar6502 · on Feb 23, 2014

1GB pages are great if you don't mind waiting 10 minutes for page being flushed to swapfile.

stickhandle · on Feb 23, 2014

Answer: Bad. Baffled by the 25 upvotes on the article (?)

dllthomas · on Feb 23, 2014

Analyzing points outside typical (or even desired) use can be informative.

isaacb · on Feb 23, 2014

Not too bad, actually. For a moment I was thinking about how it might be useful to embed large datasets directly in the page, but it wouldn't be even remotely worth the sacrifice in usability. Just make one extra HTTP request and give the user a nice spinny icon.

effn · on Feb 23, 2014

This article is not about web pages.

isaacb · on Feb 23, 2014

Oh, then it just loaded extremely slowly for some other reason... Obviously I didn't actually read the article before commenting. :P