$1,279-per-hour, 30,000-core cluster built on Amazon EC2 cloud

dholowiski · on Sept 20, 2011

I wonder what kind of 'bad activity' detection Amazon does, if any? Did they have to call Amazon ahead of time, just to warn them they were about to boot up "3,809" instances? If I tried to do that, would Amazon prevent it?

The reason I ask... how hard would it be to boot up, say 10,000 micro-instances (using a stolen credit card or AWS account) to be used for a DDOS? What do you have to do before red lights start appearing in the AWS NOC?

davidblair · on Sept 20, 2011

Amazon limits you to 100 Spot instances per region without contacting them to change the limit for your account.

http://aws.amazon.com/ec2/faqs/#How_many_instances_can_I_run...

pavel_lishin · on Sept 20, 2011

That wouldn't prevent someone from creating multiple accounts with stolen credit cards though, right? num_accounts * 100 = DDOS.

dholowiski · on Sept 20, 2011

That wouldn't prevent it but it would make it much more difficult. Although, I suppose you could automate the whole thing, right from the stealing of the credentials to the booting up of 100 new instances and adding them to the ddos cloud. I wonder if this has ever been done, or if the number of AWS users is just too low to make it worthwhile (like the old argument of why macs don't have viruses)?

ericHosick · on Sept 20, 2011

They do have limits which you have to ask to remove. For example, you are limited to 5 elastic IPs per region. We had to ask to get that limitation removed and explain why we felt that we needed more EIPS.

If I recall, we were also limited to (10?) instances and had to ask for more and explain why.

It took a few days to get approved.

eli · on Sept 20, 2011

The requirement to explain why you need additional IP addresses likely just flows out of ARIN rules that say ISPs must be able to prove their customers are using addresses efficiently and with justification.

knorby · on Sept 20, 2011

They work pretty actively with amazon on any of these massive clusters. They call ahead of time and try and find a good time (US night), etc, as I remember hearing.

ChuckMcM · on Sept 20, 2011

This is an interesting result, customer pays slightly less than a million/month (1279 x 24 x 30 = 920880) to run such a cluster. Using a 'Westmere' class processor (2 procs/mobo 6 cores per proc) that is 2500 machines, at 400W per its a MW of electricity (call it 111K$/month with a .15/kWH cost(includes cooling)). It would be interesting to price out the other costs for the machines to understand what sort of revenue that would be.

I wondered about spreading it out around the country through, seems like you would incur a lot of latency which might become a bottleneck.

mrb · on Sept 20, 2011

It's pretty well understood in the industry that EC2 is a cash cow for Amazon. I wouldn't be surprised if their operating costs were 10-30% of the hourly rates they charge (based on back-of-the-napkin calculations like you did, but also based on their competitors' rates which are much lower).

However, the cash flow generated by EC2 is likely negative as I believe that (1) all the profits are re-invested in expanding, and that (2) Amazon injects money into it from other sources.

ww520 · on Sept 20, 2011

The advantage of an AWS cluster is that you can shut it down when you don't need it, whereas your own cluster needs to be running all the time to justify its comparative cost.

ChuckMcM · on Sept 20, 2011

Aye, I agree with that whole heartedly. Its the whole timeshare market I wonder about. Basically if there is a market for this 'size' cluster for about $1,300 an hour could you build one of these and rent it out like Amazon does but more efficiently. (Sort of can you cut costs by specializing into a particularly lucrative segment of the market).

So if 100% 'occupancy' on your cluster is worth a million a month, and your cost of 'owning' a cluster of this scale is a quarter million a month, then you need to do better than 25% occupancy to break even, and anything better than that and you make money. It is an interesting financial exercise if nothing else.

ajdecon · on Sept 20, 2011

The company I work for, R Systems (http://www.rsystemsinc.com/), basically does this. We own and admin several medium-sized compute clusters (200-500 node clusters) which are rented out to customers who need a lot of compute on a temporary basis.

A lot of what we do is heavily custom, and we provide a lot of support for our customers, but we still typically beat Amazon on both price and performance. EC2 might work nicely for embarrassingly parallel workloads, but they don't have Infiniband available if you're latency-bound... :)

ww520 · on Sept 21, 2011

Definitely, and agreeing with ajdecon.

There are a lot more to supercomputing than just having lots of machines in a cluster. Special network connection, network topology, routing techniques, specialized CPU/GPU, specialized storage, all these can make a difference.

I'm sure you can build a special "Amazon for Supercomputing" infrastructure, rent it out to clients, and beat AWS on price/performance. Just add fast network, mix of CPU/GPU, SSD, large RAM and distributed RAM disk. Have some standard network topologies for easy configuration. Have some standard cluster layouts for different computing needs. May be having the software in place for the typical supercomputing needs. The clients just need to provide the data to a cluster and the answer will be spitted out.

sthlm · on Sept 21, 2011

The total costs of operation would be much greater than HW Investment + HW Operation (Power, Space, Cooling). Especially initial setup costs and maintenance costs are a big dent; other humongous cost drivers are high availability, etc.

The cluster, announced publicly this week, was created for an unnamed “Top 5 Pharma” customer, and ran for about seven hours at the end of July at a peak cost of $1,279 per hour, including the fees to Amazon and Cycle Computing

Given that the article states that the entire system only ran for about 7 hours, I assume that it was one of those ideal use cases for cloud computing. So the benefit of having a disposable system adds value that is also missing in the napkin calculation. Sure, if you ran the 30k-core cluster for eternity, you might as well build your own data center. But for this case, the comparative cost analysis seems a bit pointless.

One more small nitpick: the $1279 you extrapolated from in your calculation was the peak cost, not the average cost.

ericHosick · on Sept 20, 2011

Amazon's EC2 technology is quite amazing (to me anyway). We've been using their CloudFormation service. This has allowed us to create our entire server environment (rails/php, mysql, mongo, load balancer, route53, security groups, alarms, etc.) for different stacks (sandboxes, staging, production) with a few clicks.

"Cycle combines several technologies to ease the process" - Did that include Amazons CloudFormation services?

nivertech · on Sept 21, 2011

I doubt it, since CFN (CloudFormatioN) doesn't support spot instances yet.

mrb · on Sept 20, 2011

Chances are that this pharmaceutic company's "embarrassingly parallel" workload could be ported to GPUs to run on anywhere from 1/10th to 1/100th the number of machines.

Porting the application to GPUs may offer a good ROI if the company intends to run this workload often enough.

pinko · on Sept 20, 2011

Chances aren't, actually. GPUs impose many constraints. I engineer embarrassingly parallel workloads for a living, using both CPUs and GPUs, and only a small minority of real codes can be usefully ported to GPUs today. Those that can usually see a ridiculous speedup, but they're the exception, not the rule.

In any case, I know these guys and they're not newbies. They're familiar with GPUs and what they can do. They're running the workload they need to be running.

mrb · on Sept 21, 2011

I seem to be right. The workload was molecular modeling (see blog post); and many of the algorithms are very well suited to GPU, see: http://www.ks.uiuc.edu/Research/gpu/

I am not saying they are newbies for failing to exploit GPUs. There could be other reasons why they did that. For example the pharma needed the results ASAP and may not have ported that particular molecular modeling app to GPU yet. GPGPU is a nascent field after all.

I do too write GPU applications (crypto bruteforcers), just so you know ;)

mrb · on Sept 22, 2011

And I was right. They replied to my comment, saying that workload "still need to be ported [to GPUs]". They just couldn't do it because it was a proprietary app.

http://blog.cyclecomputing.com/2011/09/new-cyclecloud-cluste...

pwang · on Sept 20, 2011

Maybe and maybe not. GPUs are good for problems where the appropriate data can be placed near the vector processor, a priori. There are many embarrassingly parallel problems where memory coherency does not exist, or where the input data sizes exceed what is currently available on GPUs, and for these, the bandwidth to system memory is going to be a severe bottleneck on performance.

ghshephard · on Sept 20, 2011

Depends on how memory/storage intensive the tasks where: "26.7TB of RAM and 2PB (petabytes) of disk space. "

One advantage of going with Amazon, is their really high-speed and voluminous ephemeral storage available per instance in addition to your EBS backed root volume.

dotBen · on Sept 20, 2011

Amazon have GPU-based instances available - giving the customer the combination of GPU processing and the large RAM/backing store resources.

mrb · on Sept 20, 2011

They sounded completely CPU-bound from the blog post. They went with "high-CPU" c1.xlarge instances that had modest RAM and storage specs. They gave no details and expressed no concerns whatsoever about RAM or storage bottlenecks.

http://blog.cyclecomputing.com/2011/09/new-cyclecloud-cluste...

ghshephard · on Sept 20, 2011

One way to find out - I posted a comment on that blog. We'll see if we get an answer.

pbsurf · on Sept 20, 2011

The stated application, molecular modeling, often requires double-precision floating point, for which GPUs offer less of an advantage than for single-precision and integer operations.

mrb · on Sept 21, 2011

Double precision used to be a problem for GPUs, but not anymore with the latest GPU microarchitectures.

  - A single AMD HD 6990 is capable of 1275 DP GFLOPS
  - A single Nvidia Tesla 20xx is capable of 515 DP GFLOPS
  - A 6-core CPU at 3GHz is capable with SSE of only 72 DP GFLOPS

dotBen · on Sept 20, 2011

If they have the headroom to spin up 30,000 cores I wonder what their total headroom is in each availability zone - and what %age that is of their total cores.

Sure, we'll never know - but it's an interesting thought experiment.

calloc · on Sept 20, 2011

What I find more amazing than anything else is that Amazon apparently has 30,000 spare cores that one can ask for when needed.

robryan · on Sept 21, 2011

Probably not spare, probably just running spot instances at a lower bid price. You would assume that these guys paid a little bit of a premium to jump over everyone else.

merrick · on Sept 20, 2011

I worked on a small scale cluster at a pharma company 6 years ago. The scientists goal was to discover new compounds to patent without having to do the screening by hand. The cluster would screen thousands of compounds simultaneously all day long. It was very interesting work.

pavel_lishin · on Sept 20, 2011

I'd be curious to see how much $$$ in terms of developer hours it took to set this up.

pinko · on Sept 20, 2011

I know the guys who did it. It took an enormous amount of hard-won experience from similar but somewhat smaller spinups; experience that would be very expensive to replicate with a fresh team. Given that experience, however, it took them relatively few developer-hours to accomplish.

bmh100 · on Sept 21, 2011

I find this extremely interesting. Does anyone have any experience or strong interest in this sort of work? I am looking to one day use many cores like this, although at a smaller scale.

praeclarum · on Sept 20, 2011

So when your latency goes from 150ms to 300ms, you'll know who to blame.