All Your IOPS Are Belong to Us: Case Study in Performance Optimization (2015) [pdf]

denshi_karasu · on Jan 29, 2018

Hi. I'm the author of that presentation, and I just got a text message from a friend saying that I'm on the front page of Hacker News...

The one thing in that talk that I was never 100% sure of was whether it was block-mq that provided the performance improvement. It wasn't until about a year later that I came across some articles which confirmed that that it was actually due to the development of, and subsequent fixes to, the Xen persistent-grants feature.

In other words, ignore slides 15 and 16. =/

robinanil · on Jan 29, 2018

Do you have updates on how the performance is on 4.x kernel versions?

denshi_karasu · on Jan 29, 2018

Yes and no. I have all sorts of test results for the 4.x kernel, but they are for i3 instances rather than i2.*, so they wouldn't be directly comparable. Your question kind of makes me think I should put together an updated version of this talk; I've gathered enough material over the last couple of years that would probably be useful to somebody.

robinanil · on Jan 29, 2018

Yes, that would be useful. 4.x kernels has some block io improvements and some recent phoronix benchmark shows ext4 making huge strides.

smcleod · on Jan 29, 2018

This was my first thought, kernel 3.x is more than a little dated now and there is a huge amount of IO performance and latency related changes that have been incorporated since the 3.x days.

znpy · on Jan 28, 2018

Warning: this is from a 2015 talk, but I still found it interesting as it has shown me a very good way to approach the problems described in the early slides.

random_throw · on Jan 28, 2018

I find this low-level optimization and performance tuning fascinating. Can anyone recommend any good resources to get started with solving these kinds of problems?

not_kurt_godel · on Jan 28, 2018

If only there were an AWS service that existed purely for solving these exact problems so that your most technically talented employees could spend time working on your product instead of dicking around with linux kernel settings.

zytek · on Jan 28, 2018

At the time of making this presentation AWS did not have anything in their offer that could match tuned MySQL on i2 instances. Aurora was just getting started.

But nowadays? I'm all in for Aurora.

not_kurt_godel · on Jan 29, 2018

According to OP, 800 IOPS was the bottleneck and i2 compute capacity was overkill. RDS offers provisioned IOPS (aka PIOPS) - up to at least 30000 at the time (https://aws.amazon.com/about-aws/whats-new/2014/10/09/amazon...).

brianwawok · on Jan 28, 2018

You mean migrate to GCE?

not_kurt_godel · on Jan 28, 2018

I mean RDS.

brianwawok · on Jan 28, 2018

I think I would rather use GCE

https://thehftguy.com/2016/06/15/gce-vs-aws-in-2016-why-you-...

not_kurt_godel · on Jan 29, 2018

Ok, cool. However, switching to RDS would have been a no-brainer in the OP's particular scenario.

denshi_karasu · on Jan 29, 2018

Not really. The migration of hundreds of terabytes of data from one storage solution to another is never a no-brainer, and it's not just because of technical concerns. RDS is a very good solution for a lot of people, but it's not the right fit for everyone.

not_kurt_godel · on Jan 29, 2018

OP spent multiple months playing around with the kernel and mysql parameters to solve the issue. What happens when they have to do the same thing for the next version of the kernel, or patches like Meltdown? RDS has entire massive teams dedicated to solving exactly this problem for OP's exact use-case.