Data Mining 3.4 billion Web pages for $100 of EC2

chrisacky · on Oct 17, 2012

I'm a large user of spot requests for my main application stack. I run my startups core services on spot requests. These are processes that I need running 100% of the time (memcache, nfs, gearman, varnish, nginx etc).

I'm sure you have all heard of the Chaos Monkey that Netflix runs? Well, I didn't actually even need to code a chaos monkey... All you have to do is run everything on spot requests. Eventually you will lose servers at unpredictable times because someone outbids you[1].

The typical spot request pricing is at least 3/10th of the price of an on demand. For instance a cp1.medium (4 cores 4GB RAM) costs $0.044 per Hour for a spot request. Compare that to on demand and it is $0.186 per Hour.

I bid $1 per hour for my spot requests across two zones in the same region. I group my servers and use ELB (Elastic Load Balancers) to route requests...

Typically, a spot request might last for about a week before it gets killed because the capacity isn't there. That's then when instances in my other zone takes 100% of the load temporarily. At this point, since I've lost an entire zones worth of servers, I have my auto scaling group fire up on demand instances until I can get some more spot requests fulfilled. Creating a setup like this took about a week, but the savings are enormous.

----

How is the data stored on this setup?

RDS(MySQL) handles all data that can be stored in a database.

Ephemeral Storage is used to store things that don't need to be persistant (ie transactional logs).

Sessions are managed through Redis.. If the redis servers die, then session handling is handled via mysql temporarily. (It's a lot slower but the mysql server is RDS so it's always running)

Elastic Block Storage volumes are automatically mounted to a single instance which is then set up as a NFS server->client in order to allow other servers to read from a particular mount point (ie.. A user uploads an image, and it's stored on the NFS mount. A different server reads the file and starts generating different dimensions, uploads all of the files to Amazon S3, and then deletes the original file on the NFS device).

The worst part about losing servers is when the memcached server dies, because I could lose weeks worth of cache storage. When this happens, I have to boot up several micro instances that take my "cache warming" list and basically start repopulating memcached again.

The entire system is designed to be redundant... I can kill every server and then run the initialization script to start up the entire stack. (It's basically lots of little cloud-init scripts[2])

[1] http://chrisacky.com/images/lulz.png

[2] https://help.ubuntu.com/community/CloudInit

tszming · on Oct 17, 2012

Thanks for your detail explanation.

I haven't used spot instance before, I am curious how you handle the termination gracefully? i.e. When the spot instance get terminated during the middle of transactions, e.g. uploading large file, writing to DB etc.

garindra · on Oct 17, 2012

I would guess the same way you do when regular servers terminate/die for whatever reason.

raphinou · on Oct 18, 2012

The thing is, this event is usually uncommon, whereas in this architecture it is quite common.

raverbashing · on Oct 17, 2012

Very nice!

How do spot servers work (or better, how are they shutdown)? Can you subscribe to a notification prior to shutdown or it's like someone ran a shutdown -h on it?

Do you have redis saving to EBS then?

chrisacky · on Oct 17, 2012

The shutdown process is immediate. Everything just gets terminated. Redis flushes to EBS every so often. If the server running redis did die, then there would just be a "service unavailability" on services that depend on Redis. It's no major biggy... My nagios server would spot that the server died, and would fire off a new request with the preconfigured cloud-init script that tells the server what to install and then to mount the very same EBS volume to the server again. (You can do this using the ec2-* API command tools). I decided to use cloud-init scripts rather than Chef/Puppet.

Regarding uploads of images, because users uploads are saved directly to the "uploads" EBS volume which is shared using NFS, even if the instance is in the middle of an upload to S3, I will know that it failed because I could do a simple query to check my database to see when a job was started, and when it was finished. Whenever my files are "syncronised" to S3, I then set the file in the database to "s3Synced=true". (The reason why I don't upload directly to S3 using the new signature/CORS feature is that I have to generated 6 different dimensions of each uploaded image and then also watermark)

If the server running the single EBS dies (this is a single point of failure for me, but could easily be avoided if I wanted to run something like gluster or even two EBS volumes on different servers (which I don't )), then file uploads are suspended temporarily. It's a bit pointless accepting uploads because although I could store them on each server that received the request, the batch processing of resizing them and watermarkering is performed by gearman workers which run on a small cluster of a handful of micro servers.

If a server is in the middle of a transaction and dies, well then you're kinda screwed. I'd suggest that if you know that certain traffic must not drop for any reason, such as payment gateways etc, then you should route traffic through a reverse proxy to an on demand/reserved instance. You can do this in nginx effortlessly.

I'll write a blog post about the entire stack next month if people are genuinely interested. Jeff Barr just dropped me an email actually (Hi Jeff, if you are reading this comment again, saying he would be interested in a full write-up).

chaz · on Oct 17, 2012

Thanks for outlining a lot of details here, but a full writeup would also be great.

jeffbarr · on Oct 17, 2012

Hi Chris, looking forward to more details.

moe · on Oct 17, 2012

That's quite an interesting architecture.

Do you have data on how frequently your spot instances terminate gracefully (with time to finish their requests) versus how often they zap out?

Zombieball · on Oct 17, 2012

I'd definitely be interested in knowing what application you are speaking about (if you care to share)?

Have you found typical spot instance pricing to be significantly cheaper than reserved instances?

sandGorgon · on Oct 17, 2012

Why did you choose Redis instead of memcached ? In your particular setup, arent you concerned about the few little extras of memcached like locking, etc. ?

aristidb · on Oct 17, 2012

Does the actual data also reside on these instances?

EwanToo · on Oct 17, 2012

You definitely need to write this up fully!

wmf · on Oct 16, 2012

Bit of a problem with the headline: they didn't crawl anything because Common Crawl already did that.

chime · on Oct 16, 2012

Data Mining != Crawling. I don't see a problem with that.

Steko · on Oct 17, 2012

Submitted Title used to say "Crawled"

sjg007 · on Oct 17, 2012

Is grepped a better choice? You can crawl in memory from a repository, or "crawl" across the net.

sadga · on Oct 17, 2012

AWS Spot Instances are incredible. They make you a liquidity provider, and reward you for it.

Paying list price for any load that isn't mission-critical and needed immediately is insane.

chaz · on Oct 17, 2012

Also, if you're looking for a longer-term commitment but want to get out of it, their Reserved Instances can now be bought and sold. Great for if your plans change and you want to recoup some of your costs.

http://aws.amazon.com/ec2/reserved-instances/marketplace/

baruch · on Oct 17, 2012

If anyone is interested in spot instance pricing there was an interesting work on the subject: http://www.cs.technion.ac.il/~ladypine/spotprice-ieee.pdf

In short, it's not a perfect supply-and-demand market, but it's interesting to see the details they found.

cpenner461 · on Oct 17, 2012

Anyone have any experience/thoughts on using spot instances with Hadoop? Specifically, regular instances with Hadoop installed (not via Elastic Map Reduce). The cost savings are potentially huge, but I'd hate to lose my instances 80-90% of the way through a set of long running (12-48h) jobs. I guess if I had EBS backed instances I could relaunch and resume, but I'm not sure how well that'd work in practice.

xtrahotsauce · on Oct 17, 2012

We bring up spot instances all the time with EMR as additional compute capacity, and we're okay with these instances possibly going away because we use HDFS very little. Instead, we store almost everything on S3.

Evbn · on Oct 17, 2012

Why are your jobs 12-48hr? Allocate more worker hosts during job runs.

wanghq · on Oct 17, 2012

"Master Data Collection Service. A very simple REST service accepts GET and POST requests with key/value pairs. ... we then front end the service with Passenger Fusion and Apache httpd. This service requires great attention, as it’s the likeliest bottleneck in the whole architecture." Seems this can be replaced by DynamoDB.

luckyoyster · on Oct 18, 2012

Thanks for all the commentary. We're planning on presenting this work at reInvent with the folks from Common Crawl, and also releasing sample code to github. For those who haven't yet tried spot instances, or looked into the Common Crawl data set, we highly recommend them!

amalag · on Oct 17, 2012

What about using elastic map reduce with spot instances instead of a custom job queue. Hadoop seems to do this for us and supports the arc format as an inputformat.

jeffbarr · on Oct 17, 2012

You can do that with ease. Here's my blog post:

http://aws.typepad.com/aws/2011/08/run-amazon-elastic-mapred...

amalag · on Oct 18, 2012

Yes I was wondering why the author implemented his own queuing mechanism instead of just using hadoop via EMR.

zerop · on Oct 17, 2012

I want to use common crawl to periodically fetch crawled data for some of the sites. How frequently does common crawl updates its data set. Does it crawl all sites?