A 15-minute scan read of this - specifically the sections on the stuff I've worked with the most - suggests this is a very, very good addition to the official documentation.
I would as a minimum recommend anybody/everybody considering AWS to read and think about the "When to use AWS" section. Whilst it is an excellent set of tools that have completely changed the economics of deploying software, there are times when you should use Google Cloud, times you should use bare metal, times you should use Heroku. AWS is a complex beast. Heroku is simple, but has limitations.
There are a bunch of apps I'm thinking about building at the moment where I realise a hybrid approach is best: some of GCP's stack, some of AWS', and a small amount of my own bare metal. Knowing when to choose which is not intuitive and comes with time, but there are big, big clues that will help the uninitiated in that section of this open guide.
Also, if you're looking to the future, the AWS Lambda and Google Functions stuff is perhaps the most exciting stuff to start building knowledge up of now if you're a developer, I think.
> There are a bunch of apps I'm thinking about building at the moment where I realise a hybrid approach is best: some of GCP's stack, some of AWS', and a small amount of my own bare metal. Knowing when to choose which is not intuitive and comes with time, but there are big, big clues that will help the uninitiated in that section of this open guide.
unless you have a metric shitton of money to blow, there's never a good reason to start with that.
The most expensive part of any of those cloud providers is networking. If you need to transfer data from bare metal <-> aws, you'll need direct connect which charges basically an arm and a leg.
Transferring between aws <-> gce is expensive for the same reason. Sure, if you're apple scale and need better data redundancy maybe it's okay. maybe. But that's not an app you think about building as an individual or small company.
I also don't think GCPs stack has anything whatsoever that AWS's doesn't have, so it's odd to mention it in that phrase.
If you'd be so kind as to provide an example application you're thinking about, and the reason each of those is needed for some part of it, I'd be happy to hear it!
Personally, I'm not convinced the price will come down low enough for cloud functions/AWS lambda to ever be cost effective. We've looked at it + API gateway, and it would be orders of magnitude more expensive than our current giant amount of webservers.
Kubernetes (and similar technologies) on the other hand, make it possible to get the same economics as cloud functions while still tying your cost directly to the computing resources you use. Also, it gives you the freedom to (with some pain) move your entire platform to a different provider.
This was exactly my reaction. The tips around Amazon Redshift were spot-on including a few obscure-but-critical ones e.g. the one about many small tables taking up a ton of disk space!
I recommend you also make the content available on a one-topic-per-page format ASAP before someone else does and takes credit for it.
WHY: Google still doesn't handle anchor-links very well. You have 1000 amazing articles on a single page. Each section (e.g.: "High Availability on AWS") would be a great resource for someone searching on that topic in Google. But when you put it all on one page Google infers "1/1000th of this page is about high availability on AWS" and gives better rankings to a page that is 100% about high availability on AWS.
I'm sure it would be pretty simple to write a script that breaks up topics into individual pages. I love the style of having it all on one page but I think it would be a waste of your hard work not to get all this great writing in front of search.
I understand the concern. We'll try doing something about that. That said, single page on GitHub for the moment means (1) discoverability directly on github.com, which helps everyone and (2) browser search on the whole guide (which actually is more helpful than you might think!).
Completely agree, once I discover a guide like this, I bookmark it, come back to it, and really value the ctrl-f-ability.
I was recommending the one-topic-per-page idea for others who haven't yet found this nugget. I think a lot more people will discover it and benefit from it if they are finding it from specific google searches.
I know HN can be a source of a lot of unfounded flyby critiques, I dont want to contribute to that trend. I see you have a pretty good contributing guide, maybe I'll try and submit a PR with a solution in the spirit of Hacktoberfest!
As I'm sure you're aware, a lot of documentation is made available in several formats, such as 1) single page HTML, 2) multiple page HTML (e.g. one page per section), and 3) single PDF.
The different versions are automatically generated from a single common source but that would probably require a major change in how you create your guide and so may be more work than you want to take on.
To illustrate why this is useful, I'm a network engineer who primarily works with Cisco gear. Cisco has an absolute wealth of information -- product manuals, configuration guides, etc. -- accumulated over a couple of decades spread out across their web site(s). Unfortunately, their web site team likes to change things -- A LOT! -- and pages "move" frequently and it's often impossible to find them again. Because pretty much everything I'm interested in is available in PDF format, I save these versions locally where I can find them and refer to them later. Quite often, the times that I really need to look up some obscure feature are times when I am somewhere that either 1) I cannot connect my laptop to the network or 2) Internet access is unavailable, heavily filtered, or outright prohibited (of course, that's probably not going to apply for someone working with AWS.)
Regardless, you've put together a wonderful, comprehensive resource here. I'm a "minimal" user of AWS (primarily S3) but I am familiar with the different products and you're done an awesome job of summarizing Amazon's "dense" documentation down to its key points.
This is great. I've been working on AWS for close to 10 years now and an open guide is something I both need and want to contribute to.
Many of us have simple goals on AWS. The official AWS docs are thorough, but are too technical. There are blog posts about anything, they can be hard to find or get out of date.
I hope this open guide helps us all get our jobs done faster and easier!
Very glad to hear. Its this sentiment exactly that led us to get this started. We all have 100s of valuable tricks and gotchas we learn over the years, but 99% of the time fail to write down and share them helpfully. Do join us on Slack/GitHub and help us get your tips included, too.
I have the same issue with not writing stuff down and there are plenty of gotchas. I finally got round to starting a 3 part blog series on AWS vs Azure, vendor lock-in and pricing confusion [^0], but I'll see if I can contribute to this too.
I am relatively new to building larger apps. I've worked for a coulple years building with Drupal and hacking PHP. Now I only want to develope full stack JavaScript. I really enjoy it's messy nature. Last week I discovered that user uploaded files are not persistent on a Heroku hosted app. To solve that problem, I created an AWS S3 account which is the first time I've used AWS. I quickly figured out to exchange Node.js fs functions with the AWS SDK. Setting up a bucket and a test bucket easy. And, configuring IAM rules is intuitive.
You're right. Their docs are far beyond the scope of what I needed to get started. Interestingly, I would rather have Google searches about AWS show Stack Exchange answers but most of the first results are all Amazon documentation which is far more difficult to read and sort.
Wow, the link to http://www.ec2instances.info/ alone is so helpful. I wish I'd had this set of resources a year ago when I spent weeks trying to understand AWS' own documentation.
If you want to answer the question "What's the cheapest way to get 16gb of ram and 4 cores?" (or the same for a 1 year term) then having a list you can filter and sort is much more helpful than Amazon's pricing pages.
Upvoted for this link alone. I am so, so tired of the scroll, squint, hunt & jump I have to do on the current Amazon EC2 pricing page to compare costs and features of instances. Especially when trying to compare legacy instances (which we still have a lot of) to newer or VPC ones.
Remember, this isn't a blog, it's living GitHub project: If you see value in info like this, consider contributing or giving feedback to improve it. :)
What I would consider one of the most important pieces of this guide is closer to the bottom (https://github.com/open-guides/og-aws#aws-data-transfer-cost...) where it covers cost management strategies. The Data Transfer Costs diagram makes the buried details of AWS networking costs stand out in a digestible way. I've read the AWS docs on this many times and still missed out on some of the nuggets exposed in the diagram.
As a consultant that often recommends migration to AWS services for clients, this is a treasure-trove of information when looking at each individual use case and making a determination about how best to advise. It's often difficult to know with certainty whether AWS vs Google Cloud vs bare-metal is the best course of action, and the advice and information here goes a long way in helping make those decisions easier.
One of the biggest lessons I've learned is that you need occasional EBS-to-EBS backups. Anyone that had to recover from snapshots knows the painful reason why...
I get a lot of shit for not giving straight answers... just spin up an instance, put a gig of data on EBS drive, snapshot, create EBS from snapshot as if you were recovering, and try pulling 100+ megs of data off it... you'll never not keep EBS copies again. big clue: pre-warming
it will take you an hour to do, and you'll be years wiser
this is probably the number one reason people experience extra extra downtime when suffering from rebuild from whatever issue... and EBS volumes in certain regions can and will experience silent deaths
The "use IAM roles for EC2" recommendation is a bit sketchy. The current security zeitgeist, not just after Colin's post but also after DerbyCon and Black Hat, is that EC2 roles are dangerous and, when under attack, not very predictable.
Using IAM roles for EC2 is far and away better than what beginners would otherwise do, which is create a set of permanent credentials and deploy it everywhere.
"Have the application retrieve a set of temporary credentials and use them." "In the case of Amazon EC2, IAM dynamically provides temporary credentials to the EC2 instance, and these credentials are automatically rotated for you."
Attacker should only have access until creds are expired no ?
That's right. Instance store credentials have an expiration time of a few hours. However, if the instance policy is very open you could create yourself a new IAM account or use STS to maintain persistence after the generated credentials expire.
This is why it's important to lock down instance profiles to do only what the application needs to do and no more. For example, you may give the permission to s3:DeleteObject, and in the event that the box is compromised the attacker would be able to delete files in your S3 bucket. However, if you don't give access to s3:DeleteObjectVersion you can evict the attacker and restore the deleted objects with relative ease.
This is why I would not recommend giving access to s3:* to an instance profile (or indeed, any production credentials).
Thank you for the reply - that makes sense to me, least privilege seems to be the primary defense in that case. Having explicit creds you rotate yourself I could see having benefits as far as control, but also requires more work / potential for implementation mistakes.
Well, the AWS credentials auto-rotate. It does however provide a familiar place for an attacker to go to get the instance credentials, but that doesn't really help. At some point, those credentials must exist in plain-text for you to use them. If they're in a config file, they can be read out, if they're in RAM they can be pulled out with a debugger. At least if your box is temporarily owned due to a zero-day that you later patch, the credentials aren't going to be valid for long - although that situation would be hardly ideal!
You've also got to go to the trouble of getting the credentials on your box to start with. With instance roles, you can launch an instance and have it immediately capable of doing what your application needs. In the case of most applications my company runs, the instance profile is enough and no further security credentials are required. When database credentials are required, they're retrieved via S3, authenticated by the instance profile.
we use iam roles and credstash(dynamodb and kms) for retrieving database credentials. My comment was mostly in terms of the fact we cannot control the rotation for roles, say in the event of a breach like where someone committed keys to github and I can explicitly expire/rotate(assuming those keys were not themselves temporary and have not already expired :))
I believe you can actually [0]. In a production setting it's a lot harder to accidentally leak the credentials - my concern would be if someone compromised the instance or if it was tricked into opening the instance store up to the net, such as a badly configured nginx instance (how you'd do that accidentally though I have no idea)
I would really, really appreciate if you would elaborate on this. Security seems to have the most unspoken community knowledge of anything I need to know.
Yep, not sure where this perception of "nobody's using it" comes from but I have been using it in 2 different companies in the last 3 years as well with nothing but love. In fact, if it were the case that "nobody's using it for good reasons", maybe we should ought to know the reasons?
Been using opsworks for about a year now and while it has very significantly streamlined our provisioning/deployment tasks, "nothing but love" is not quite how I'd describe it.
You could code up something in the deploy hook to select the master node (mostly the first instance in the layer) to run migrations and you could disable the "Run Migrations" when you deploy. I do this for the Rails app in my company.
I actually solved this with our custom deploy script. We choose the machine that runs the migrations.
The necessity of building and maintaining a custom deploy script is the biggest wart for us (though I admit that the API is pretty good, and said script has not had that much maintenance overhead).
Those of use contributing on the Guide so far have generally been companies where it's not used. I'd love to see a contribution (a few bullet points and/or links) that better covers the basics and reflects how/when it's useful.
Please write an update and submit a PR. I'm moving from Ansible to Chef and would love some real world advice on what Opsworks has to offer me without another dreaded POC.
It's likely the original authors aren't using Chef or just use Chef server as I do now.
My guess is that there are companies with "legacy" applications, that can't really be re-written into a distributed system, have a large footprint, but still need to be run.
The special sub-category of those are huge RDBMS instances - a pretty common choke point in growing companies with weaker engineering teams. Some of those companies would pay basically any price to keep those DBs running.
I've temporarily scaled up to c4.8xlarge for a few hours every now and then to get some parallelized computations done quickly. Plays nicely with Clojure's (pmap) function.
applied ML research here also -- a lot of interactive (but highly parallelizable) modeling, graphing. Using medium-size data sets around 3-4GB in ram, by the time you forked it a few times, you easily end up beyond the m4.10xlarge or c4.8xlarge limits.
IMO theres an awkward space between small data and big data where it isn't really worth spending a long time to treat it like a real "big data" problem, and the x1 instance gives you an easy-out.
> A single EBS volume allows 10k IOPS max. To get the maximum performance out of an EBS volume, it has to be of a maximum size and attached to an EBS-optimized EC2 instance.
Out of date; EBS volumes can be up to 20k IOPS per volume and what is "maximum size"? To get the maximum performance out of a volume depends on workload, the instance size you've attached it to (rather than EBS Optimization) and the number of IOPS provisioned, and whether you've prewarmed it from a snapshot restore or not.
> A standard block size for an EBS volume is 16kb.
A block can be 1kb -> 256kb in size. It depends on the application.
> EBS volumes have a volume type indicating the physical storage type. The types called “standard” (st1 or sc1) are actually old spinning-platter disks, which deliver only hundreds of IOPS — not what you want unless you’re really trying to cut costs. Modern SSD-based gp2 or io1 are typically the options you want.
The ST1/SC1 wording is misleading. You only need '100s' of IOPS when dealing with big blocks for ST1, and SC1 isn't performance oriented at all.
Any IOP on EBS is measured in 16kb granularity. Not the same as block size but helpful to know because it lets you set your read ahead and other values to not lower than 16kb. At least this was the case for many years. Trying to find the official docs now.
Great work! I started using AWS back when it was just simple websites, and the plethora of services now (50!), and pricing (especially pricing!), is overwhelming to track.
So overwhelming, in fact, that I decided it was easier to get some VPSs and use common, work anywhere, tools to manage (e.g. saltstack), than have to skill up on AWS specific stuff.
Thanks a lot for posting this, I went to a linux conference over the weekend and was talking with some friends about their datacenter jobs. I felt hopelessly lost in trying to understand all its intricacies at routing, storage, and backup levels where this guide gives a good bird's-eye view of the stacks.
I would add as a VPC gotcha the use of the EIP_Disable_SrcDestCheck flag [1] to enable layer 2 capabilities. This is a feature that is only present in AWS. Neither Google Cloud Engine nor Microsoft Azure have it. So, if you craft an Ethernet packet modifying the destination address but not the destination IP in your local subnet, the packet will be sent to the computer by IP and not by MAC address as you expect in an Ethernet network.
I have recently started out on AWS (I initially used AWS like I used to use Digital Ocean, however after trying out Serverless, I'm of a different mind and changing my ways to do it the AWS way), So this is pretty awesome!
I had tried a lot of databases (postgres, mongo, couch and very recently Rethink) before trying out Dynamo. So I just jumped in, and started something basic, and read tutorials as I went along.
There's still a lot of stuff I don't fully know about (for about Read / Write volumes that is set - I left it at a default of 5) but I guess, I'll learn as I go along.
Great guide. I've been using AWS since there were only a handful of services and it's become increasingly hard to keep up with all the additional ones that have been added in the last few years.
EFS had completely passed me by. Does anyone have experience with it? I'm wondering what it would be like to use for Whisper / Graphite (just on a single machine). I'm less interested with concurrent access and more interested in not having to resize drives as data grows / overprovision drives all the time.
The latency is higher than I had hoped. I wrote 10,000 files with 10 kb in each. It took 23 ms per file on average. Then I read them back. That took 8 ms per file on average.
That's way too much for the use case I was contemplating, so I didn't investigate further.
It definitely _felt_ a bit slow rysncing to it last night. In the Whisper use-case, there are a ton of small appends to do every minute - so that could be an issue. I'm going to set up a machine with a linux 4 kernel today to try it on (as that's what they recommend, along with async mode).
As I recall it's significantly more expensive than EBS, which kept me away. I've had a few use cases come up where shared access would be nice, but I was always able to use objects in s3 instead, which is far cheaper.
It's about 3x the price of EBS by the looks, but then again, I probably run 10x the size EBS drive required so I don't need to deal with scaling it often...!
A good edition, but I wish there was a place for horror stories about this tech. For instance, we can't launch or than 4 or 5 containers a second on our ecs clusters.
This is so needed. I find Amazon's official documentation to be way too full of buzzwords and marketing speak. I just want someone to tell me what the thing does!
I think a better approach would be to use annotations on the current AWS docs so that additional information is inline with the official documentation so you have both in the same place. The Hypothesis project is working on such a browser plugin that does this for example and is having success with academic research already. https://hypothes.is/
You probably are aware, but AWS has a container orchestration service built into the platform with ECS. The container agent is open source (https://github.com/aws/amazon-ecs-agent).
In my experience, ECS is easy to run, as it's a first class part of the platform. Boot up the right "cattle" AMIs with the right ASG configuration and you're good to go.
K8, Docker Swarm, Mesos and Nomad have plenty of documented success but you to stand up and operate the orchestration layer yourself. This is booting up "pet" AMIs and making sure they are monitored, etc. Then you boot up your "cattle" AMIs to run your apps.
The Convox philosophy is that you get application portability by packaging your app correctly with Docker. The orchestration layer should be invisible, something that you shouldn't build or operate yourself.
We run Rancher[1], which is open source, across multiple AWS regions using a single ELB endpoint for container orchestration into different environments. You can use the stock AWS AMIs for the instances and Rancher also provides RancherOS AMIs that work extremely well.
Rancher also has k8s as an option and makes deploying it much easier.
Although I'm familiar (high-level only) with numerous topics/services related to AWS, I'm still doing things the legacy way on providers like Digital Ocean (which I'm 100% happy with), and by no means a guru of AWS...So this guide looks awesome for someone like me!
Really like the single page format. Much easier to search compared to scattered documentation on AWS's own site.
Definitly like the 1:1 mapping to Google /Azure
Well, you could start by submitting a PR with everything you already know about Beanstalk:
1. That would be very valuable for everyone else.
2. A section, that does not look overwhelmingly empty would attract more and higher quality contributions from others. Kind of a reverse broken windows theory (https://en.wikipedia.org/wiki/Broken_windows_theory).
I've had the impression that elastic beanstalk (which I use) has suffered the fate of a few other Aws offerings in has been seen as less trendy than Docker/ECS. (See also: cloud search vs elasticsearch). But EB can do some things very well and very painlessly.
EB tends to work very well when you're requirements fit within its framework - and very badly when you try to do anything differently. We've moved to CodeDeploy because: EB was slow to deploy; often left applications in an 'unknown' state after deployment; ties application configuration to deployment; and generally felt fairly restrictive.
I've been compiling a lot of tips and tricks personally that I use to help train coworkers. I'm definitely going to cross reference and see if I can open a few useful PR's.
This is fantastic. I was thinking about it in the afternoon and I see it now! Very useful for guys like me who are just booting up in the back end and devops side!
I would as a minimum recommend anybody/everybody considering AWS to read and think about the "When to use AWS" section. Whilst it is an excellent set of tools that have completely changed the economics of deploying software, there are times when you should use Google Cloud, times you should use bare metal, times you should use Heroku. AWS is a complex beast. Heroku is simple, but has limitations.
There are a bunch of apps I'm thinking about building at the moment where I realise a hybrid approach is best: some of GCP's stack, some of AWS', and a small amount of my own bare metal. Knowing when to choose which is not intuitive and comes with time, but there are big, big clues that will help the uninitiated in that section of this open guide.
Also, if you're looking to the future, the AWS Lambda and Google Functions stuff is perhaps the most exciting stuff to start building knowledge up of now if you're a developer, I think.