This is one of the keys reasons I recently started my new project, Uptano (shameless plug: https://uptano.com), and all servers use dedicated RAID 1 (two drives) with 10K RPM or SSD storage.
The same issue applies to network performance. I've seen very expensive EC2 instances that couldn't even push 50 Mbit/s to the net, while instances of the same type could at least do a few hundred Mbit/s. AWS' answer was always to simply buy even more expensive instances, so less people are sharing, but that's a terribly costly answer.
I'm doing bonded (802.3ad) 2x1 Gbit/s connections on all servers, because that's what I wish EC2 had.
Multiple customers, with highly varied workloads, sharing the same physical server hardware is simply a fundamentally flawed idea. IMHO, it only makes sense to use a VPS for very small personal projects, where you don't want to justify ~$140/mo in server costs.
EC2 was a really novel thing and it brought lots of great technology to the scene, but they made a few fundamentally wrong choices.
We've had similar issues in the past like when we've used RDS for some of our projects (In particular issues with disk IO). AWS a great place to start, but chances are the "one size fits all" solution is going to get difficult once you're doing more advanced tasks.
I’ve been working on some similar analysis with EC2 and other providers. I think the big missing point in this post (acknowledged by the author) is with regards to EBS optimized instances and provisioned IOPS where we’ve observed a dramatic improvement in IO consistency. Another interesting observation I've experienced is that performance consistency often declines using multi-EBS volume raid, likely due to variations in spindle tenancy or network latency variations. EBS test volumes were 300GB. Better performance/consistency is possible using larger EBS volume sizes.
Here are links to a couple summaries of the analysis I’ve done on EC2, Rackspace and HP. I plan on writing a blog post regarding this analysis soon.
Disk Performance:
The value columns is a percentage relative to a baremetal baseline 15k SAS drive, where 100% signifies comparable performance. Benchmarks included in this measurement are fio (4k random read/write/rw; 1m sequential read/write/rw), fio – Intel IOMeter pattern, CompileBench, Postmark, TioBench and AIO stress:
Disk IO Consistency:
The value column is a percentage relative to the same baseline. A value less than 100 represents better IO consistency than the baseline. The value is calculated by running multiple tests on an instance, measuring the standard deviation of IOPS between tests, and comparing those standard deviations to the baseline. Testing was conducted over a period of a month on multiple instances in different AZs.
OP here. We'd like this work to be a useful resource - everyone benefits when there's more / better information about how these complex cloud systems perform in real life. So please comment with suggestions, questions, or any other feedback!
I wish you tested with larger numbers of EBS volumes. I've just started researching EC2 and EBS, and the links I saw seemed to indicate that large numbers of EBS volumes with raid0 smooths out the performance (8,16,24 volumes). http://orion.heroku.com/past/2009/7/29/io_performance_on_ebs...
The graphs are very well done. That amount of data would have been incomprehensible if not for your carefully thought out charts.
Thanks. I did put in quite a bit of work in finding the best ways to present the data, which depending on how you count is around five- or six-dimensional.
I'll see whether we can work larger RAID configurations into a follow-up.
I found it interesting, but have to admit I scanned through to read about provisioned IOPs almost right away and was disappointed that there weren't any benchmarks.
Yes -- unfortunately, Amazon announced provisioned IOPs approximately the day after we finished the benchmarking work. It took a while to find time to write up the results.
We'll very likely do a followup post to test provisioned IOPs. If anyone has other suggestions to include in the followup work, please let us know.
"EBS Optimized" instances dedicate an extra 1 GigE interface just for EBS IO. If your application causes contention between it's service and EBS IO, "EBS Optimized" instances may be for you. Unfortunately, I haven't seen any benchmarking on it yet.
I wish this article had some comparison of AWS I/O (both ephemeral and EBS) to the real hardware (some average HDD and SSD, maybe RAID-1) in terms of random R/W throughput, sequential R/W throughput, single thread random IOPS, multithread random IOPS.
I'd love to see a similar set of benchmarks (or perhaps the same one) run on other cloud providers like Rackspace, Linode, etc. Any chance of seeing the nitty gritty details up on Github?
Great article got all the way through it. It was similar to what we saw, is that ephemeral storage on large and xlarge are almost always the way to go. One thing missing from the article was a good comparison of $/iops and $/iops/storage between the instance types.
$/iops between storage types is the subject of the "Cost Effectiveness" charts near the top of the post, but I suspect I'm not quite catching your meaning. What do you mean by $/iops/storage?
Enjoyed the article, but it feels very incomplete without discussion of the solid-state, provisioned iops, and ebs optimization options. Would be interesting to see if those get rid of the bad apples and what sort of benefit they give.
One question: on the throughput graphs, I understand why you normalized them per graph, but were there any differences between graphs (particularly in terms of EBS vs. ephemeral) that would be sufficient to drown out the variability within the throughput graphs?
Yes, the differences between graphs were fairly substantial. You can get a sense of the EBS vs. ephemeral difference by looking at the "Throughput by Threadcount" graphs (about 1/3 of the way through the post). Briefly, EBS has much higher throughput for random writes; ephemeral (instance storage) is faster for reads and bulk writes.
Hmm, it shows "4-EBS RAID" getting around 2.6x speedup for
4k writes. They don't say what RAID configuration they are using, but it sounds odd.
A 4k write has to be synced to all the disks unless they have a <=4k stripe size AND are using RAID-0
AND are using stripe-aligned IO ops. It's also
possible they use 4k writes to cache that end up forming large dirty blocks which the OS then syncs as larger I/Os.
But that would be measuring something else than the benchmark claims.
Correcting myself: they do say in the methodology section
that they are using RAID 0 and say that the write benchmark
is multithreaded. So that explains it, it's measuring
write throughput for 4k writes.
What size were the EBS volumes? Would be interesting how important the volume size is to performance.
If you only provision 1TB (or larger if they have them available now) EBS volumes then you'd have spindles dedicated to you whereas with smaller ones there might be a lot more variations because you share.
i wish OP would've atleast included some "conclusion/final words".
for someone in the same boat of choosing the right platform, this benchmark serves no real help in deciding which to go with. a comparison with Rackspace Cloud for one, would be very helpful.
I think the conclusion is that you need to provision a lot of machines, and benchmark your own workload thoroughly to assess any cloud provider. That way you'll have solid performance numbers that are relevant to your case. They (rightly, in my mind), don't come down one way or the other on particular instances or options, because the cost/value analysis depends so specifically on your workload. If you know your workload very well already, they give enough detail that you could reuse some of their data.
The same issue applies to network performance. I've seen very expensive EC2 instances that couldn't even push 50 Mbit/s to the net, while instances of the same type could at least do a few hundred Mbit/s. AWS' answer was always to simply buy even more expensive instances, so less people are sharing, but that's a terribly costly answer.
I'm doing bonded (802.3ad) 2x1 Gbit/s connections on all servers, because that's what I wish EC2 had.
Multiple customers, with highly varied workloads, sharing the same physical server hardware is simply a fundamentally flawed idea. IMHO, it only makes sense to use a VPS for very small personal projects, where you don't want to justify ~$140/mo in server costs.
EC2 was a really novel thing and it brought lots of great technology to the scene, but they made a few fundamentally wrong choices.