Comparing Cloud Compute Services

jdub · on July 8, 2014

This is terrific work. I agree with other posters that the graphs and supporting information could be improved, but underneath the presentation of the results, you've done a VERY good job avoiding the pitfalls most comparisons suffer. As this was mentioned in your intro and conclusion, it was clearly one of your goals. Nailed it. :-)

traek · on July 8, 2014

I found the results very interesting, but I'm not a big fan of the charts. The x-axis is categorical, not continuous, so a line graph isn't appropriate here.

dxbydt · on July 8, 2014

> so a line graph isn't appropriate here.

Not just inappropriate, but plain incorrect as well. The charts as they stand imply that the user can pick an offering between "small server" & "medium server" and you'll get a performance along the interpolated line ?! Use scatterplots.

jread · on July 8, 2014

Point taken - graphs will be improved next time.

davidjgraph · on July 8, 2014

My colour blindness is relatively mild, but I struggled with many of the charts. I'd suggest putting the labels on the actual lines, either in the chart area or one side or other next to the line termination.

It sounds trivial, but with 20+ charts that I had to stick my face up to the screen for, I gave up after about 5.

Regarding "Due to SPEC rules governing test repeatability (which cannot be guaranteed in virtualized, multi-tenant cloud environments), the results below should be taken as estimates.". I'd like to have seen some attempt to migate this with, say, some kind of averaging of x tests over different instances. Although, I understand this would have extended an already long test process.

jread · on July 8, 2014

The SPEC CPU 2006 estimate comment is a legal requirement for any public results that cannot be guaranteed to be repeatable (which will never be the case in a virtualized, multi-tenant cloud environment). The results provided are based on the mean from multiple iterations on multiple instances from each service. If you were to run it on these same instances your results should be similar. The post also provides standard deviation analysis for all SPEC CPU runs on a given instance type to demonstrate the type of results variability you might expect.

mje__ · on July 8, 2014

I'm really impressed with the amount of effort that has gone into this study, but those charts are just awful. Those line charts should be bar charts IMHO, I'm immediately scanning the chart left-to-right before I realise that the x-axis are actually categories

mikektr · on July 8, 2014

DigitalOcean got burned pretty badly in the network performance and latency benchmark, not to mention disk I/O. DO also came in last for availability. Amazon seems to come out on top in most categories (including price/performance with T2) except a few cases where Rackspace did really well for database I/O and Softlayer doing really well in large database random read/write throughput.

walterbell · on July 8, 2014

High-quality content. You should repeat the download link at the bottom of the page, after the reader has been impressed by the data and may want more, but will definitely have forgotten this:

"This post is essentially a summary of a report we've published in tandem. The report covers the same topics, but in more detail. You can download the report for free."

jread · on July 8, 2014

Good feedback, thanks - just added a link to the end.

solarwind4 · on July 8, 2014

Very detailed results. It's interesting how poorly digital ocean did for consistency of IO. Amazon, Rackspace and Softlayer seemed to fare best in many categories ahead of Azure, GCE, digital ocean etc. GCE database IO performance seemed particularly poor. Amazon seems to be far ahead of the rest for internal network throughput and latency.

jread · on July 8, 2014

GCE no longer offers a local storage option, this is one reason slower IO. They also cap IOPS for better IO consistency.

asb · on July 8, 2014

It's a shame Linode weren't included in this now they too have per-hour pricing.

jread · on July 8, 2014

Linode will be in the next report. We chose the services and began working on this report before Linode announced their infrastructure upgrades.

opendais · on July 8, 2014

Agreed, especially since DO & Linode are fighting over the same marketspace.

pinhead · on July 8, 2014

I was disappointed they didn't include internal network latency variability like they did with disk performance. I've seen EC2 have wildly different network latencies at times, but haven't tried any of the other services.

jread · on July 8, 2014

Mean latency RSD was EC2: 8.9%; Azure: 10.3%; DigitalOcean: 23.7%; GCE: 8.9%; Rackspace: 6.1%; SoftLayer: 16.8%.

pinhead · on July 8, 2014

Good to know, thanks for the reply. Does this seem to change with instance type?

jread · on July 8, 2014

Not really for latency - throughput is more variable between instance type, particularly for services like Rackspace that have instance specific limits.

cordite · on July 8, 2014

I am glad that the colors were consistent between the graphs so I spent less time referring to a legend.

If reports like this become regular (say, a monthly occurrence), would it be possible or feasible for the cloud providers to try to game (or optimize for) certain qualities?

jread · on July 8, 2014

It is possible - but it would be difficult to optimize for every performance characteristic because they are derived from many different benchmarks. This was a problem in the tpc-c years with database vendors optimizing specifically for a single benchmark.

cordite · on July 8, 2014

Does that have anything to do with the DeWitt clause? (which is apparently also in the datomic license (last I looked))

brendangregg · on July 8, 2014

What observability tools were used to confirm that the target of the test was actually being tested properly?

I've performed, and also debugged, a lot of these cloud comparison benchmarks, and it's very, very easy to have bogus results due to some misconfiguration.

brendangregg · on July 8, 2014

As an example of some detail that is lacking:

What is the total file size for the fio runs? (And, what is the intended working set size?) Was fio configured to bypass the file system cache and perform I/O directly to disk? (And if so, what is the rationale for bypassing it?[1]) Was iostat or other tools run during the benchmark to confirm that fio was configured and operating correctly, and that the results could be trusted? Was the same version of fio used, and built with the same compiler (same binary?).

The 118 page report does not include the actual fio commands used.

[1] Disk I/O latency can be a serious issue in cloud environments, and one that vendors can address by incorporating additional levels of storage I/O cache (eg, in the hypervisor). Picking benchmarks that bypass discourages vendors from doing this, which is not ultimately good for our industry.

jread · on July 8, 2014

The intent of the fio testing was to measure block storage not cache I/O. direct I/O was set in fio configs, but is usually not honored by the hypervisor. During run-up a 100% fill test was performed using refill_buffers and scramble_buffers to break out of cache. Then optimal iodepth settings are determined for each workload and block size by running short tests with incrementing iodepth settings (targeted for maximum iops). Once iodepth is determined, 3 iterations of tests are performed, each with 36 workloads (18 block sizes, random + sequential). Each of these is 15 minutes (5 minute ramp_time, 10 minute runtime). Since asynchronous IO and variable iodepth settings were used, latency wasn't compared. Total test time per instance for run-up and 3 iterations was about 36 hours. fio configs are available here (iodepth and device designation are added at runtime):

https://github.com/cloudharmony/fio/tree/master/workloads

brendangregg · on July 8, 2014

Thanks, but I disagree with the approach of only showing storage benchmarks with disabled caches. Production workloads will encounter variance between the providers thanks to different caches and behaviors of handling direct I/O. I'd include direct I/O results _with_ cached results, so that I wasn't misleading my customers.

I know what I'm suggesting is not the current norm for cloud evaluations. And I believe the current norm is wrong.

The more important question is how the benchmarks were analyzed -- what other tools were run to confirm that they measured what they were supposed to?

jread · on July 8, 2014

Good point - user experience may include cached and non-cached I/O so it would be beneficial to include both in this type of analysis.

The benchmarks binaries, configurations and runtime settings were generally consistent for instance types of the same size across services, but we didn't verify efficacy of the benchmarks as they ran.

gtaylor · on July 8, 2014

wow, I was expecting to hear "We couldn't do much to remove the caches from the equation", but it looks like you guys put a lot of work into doing what you could.

Excellent work, thanks for doing a quality job on this.

CMCDragonkai · on July 8, 2014

There's also a thing called ServerBear, its like server benchmark ratings but ran by users.

gidgreen · on July 8, 2014

See also the continuously updated benchmarks at cloudlook.com

hhw · on July 9, 2014

The problem with benchmarks are that it's really, really difficult to emulate real-world conditions. However, here some of the more obvious points that are unrealistic.

1) Comparing between same number of cores. The core count selected for each testing level is completely arbitrary. With both web and database servers, which scale well to increasing core count, single-threaded performance is generally less of a concern and should not be a point of measure aside from average page load time. Some server configurations are optimized for higher numbers of slower cores, while others are optimized for fewer but faster cores. By comparing like core counts, this testing is highly skewed to the latter.

Comparing packages at the same price point, or just how the package fits into the product offering lineup (smallest, median, largest instance) would be much more fair to compare. If 4 cores at one provider costs the same as 1 core at another, it should be fair to compare the two at different core counts.

2) Server configurations. For both web and database servers, the best performance optimization that can be done is to cache to RAM as much as possible. With increased caching, the need for disk I/O also goes down significantly, and can easily be by as much as an order of magnitude. Serving static content uses minimal resources and is mostly dependent on network performance. Dynamic content is more CPU intensive, and most of the time you can and should be caching the compiled opcode/bytecode. Most website database usage is read heavy, and many of the queries can be cached as well. The one drawback to a heavy emphasis on caching is that if the server restarts, there may not be enough resources to service all requests while warming up the cache. However, given that dynamic loads is precisely what cloud offerings are supposed to excel at, you can spin up additional instances at these times, or just take a horizontally scaled approach to begin with so that a single instance failing will not have a major impact on your aggregate load.

3) Synthetic benchmarks, by their very nature, do a poor job of emulating real world performance. The best way to benchmark both web server and database is to take a real site, log all the requests, and replay the logs. What you want to measure for is the maximum number of requests or queries that can be served, average time and standard deviation at different requests/query rates, etc.

4) Network speed tests. The biggest mistake that most tests make is that they measure performance from content network to content network, rather than from content network to eyeball network. Especially with the current peering issues going on between carriers and eyeballs, this is more important than ever. This is a very difficult problem to solve however, as it's not easy to do throughput tests from a large number of different eyeball networks. You would have to take a very large number of client generated results, and compare differences for all the different providers in all their different locations, which would be nearly impossible. The next best thing, while still a lot of work but more feasible, is to collect up IP's for eyeball networks for as many different locations as possible, but perhaps just the top X number of cities by population, and run continuous pings/traceroutes over an extended period of time. You can then just use average latency, standard deviation, and packet loss % as the metrics.

jread · on July 9, 2014

Appreciate the constructive feedback. Just a few points of response:

1a) Core selections were not arbitrary - they covered most compute instance sizes offered by each service: 2, 4, 8 and 16 cores

1b) Testing was not skewed by faster cores. CPU performance analysis was based on SPEC CPU 2006 using base rate runtime. SPEC CPU is a multi-core benchmark with metrics that scale linearly on number of CPU cores.

1c) The value analysis (based on CPU performance and price) highlights differences in pricing between services. I believe matching CPU cores and deriving value is preferable to comparing compute instances based on price.

2) The intent of the post was to provide and compare multiple relevant performance characteristics for these types of workloads: CPU, memory, (non-cached) storage IO, network. Each of these characteristics is analyzed separately using relevant benchmarks and runtime settings. If your workload relies more heavily on cached IO, then more emphasis could be placed on the memory performance analysis.

3) There are hundreds of cloud services. High level analysis like that provided in this post can at least get users pointed in the right direction.

4) Web server network performance analysis is based on "eyeball" or last mile testing from our cloud speedtest http://cloudharmony.com/speedtest. As stated in the post, this is the most complex performance characteristic to measure. Internal network performance is also included in the DB server analysis.

hhw · on July 9, 2014

> 1b) Testing was not skewed by faster cores. CPU performance analysis was based on SPEC CPU 2006 using base rate runtime. SPEC CPU is a multi-core benchmark with metrics that scale linearly on number of CPU cores.

I wasn't suggesting that the testing was skewed by faster cores, but that your criteria favour hardware configurations optimized for fewer, faster cores because of your comparisons between packages of the same core count.

> 1c) The value analysis (based on CPU performance and price) highlights differences in pricing between services. I believe matching CPU cores and deriving value is preferable to comparing compute instances based on price.

We'll have to agree to disagree on this point. It's been my experience that in real-world situations when comparing options, nobody is going to reject higher core counts when at the same price. Generally, a decision maker looks for the lowest price that meets all requirements, or the best option that still fits within budget. They're not going to simply say that since one provider offers a package with this particular core count, that they'll only compare against other providers' options with the exact same core count regardless of price. That's what I mean by the core selections being arbitrary.

For example, an E5 2643v2 6x3.5GHz CPU costs about the same as an E5 2670v2 10x2.5GHz CPU. The 2670v2 offers approximately 25% more aggregate performance, and is clearly the better option except for the unusual cases where single-threaded performance is more important than aggregate performance. However, given how you choose to compare different packages with the same core counts, infrastructure using the 2643v2 would be favoured in your testing. The decision to compare a single 2643v2 core vs a single 2670v2 core, would be completely arbitrary.

> 2) The intent of the post was to provide and compare multiple relevant performance characteristics for these types of workloads: CPU, memory, (non-cached) storage IO, network. Each of these characteristics is analyzed separately using relevant benchmarks and runtime settings. If your workload relies more heavily on cached IO, then more emphasis could be placed on the memory performance analysis.

My point is, the vast majority of web and database servers either rely heavily on cached IO, or should be. SSD's should not be necessary for web servers; you will be able to serve a higher number of requests with the same budget using a larger amount of RAM instead. Likewise, you should fit your entire database into RAM if you can, and just resort to SSD's when you can't. And this isn't limited to I/O, you can reduce the amount of CPU resources required by caching to RAM also. As such, doing a comparison of packages based on the amount of RAM included is going to be more meaningful than doing a comparison based on the number of cores.