Our watchdog process calls the EC2 APIs directly to identify how many instances are running, which ones are spot instances, etc. Boto, the AWS client library for Python, makes that pretty easy. The watchdog isn't very sophisticated -- it just checks to make sure that the correct number of instances are running in each auto-scale group. Our application servers aren't very efficient in certain respects, so we don't trust metrics like usage/load to make auto-scaling decisions.
If I was doing it over again, I'd just use Amazon's auto-scaling features for all of this. At the time we built this, EC2's auto-scaling didn't support some of the features we needed. Since then, they've made it a lot easier to do things like set up a repeating schedule for auto-scaling, rather than using metrics.
We only have one EC2 AMI that we use for all of our servers. That AMI is pretty basic; it only does enough to connect to our Puppet configuration management servers. Puppet then configures the boxes as web servers (or databases, or...) and adds them to the appropriate load balancer.
We revise the "right number of instances" every few weeks based on latency and traffic numbers. But sometimes when we release updates, we'll find that we suddenly need a lot more capacity (or a lot less if we improved performance). We have automated tools to help us notice performance regressions. Once we decide that we need to change the pool size, we adjust the watchdog configuration by hand.
If I was doing it over again, I'd just use Amazon's auto-scaling features for all of this. At the time we built this, EC2's auto-scaling didn't support some of the features we needed. Since then, they've made it a lot easier to do things like set up a repeating schedule for auto-scaling, rather than using metrics.
We only have one EC2 AMI that we use for all of our servers. That AMI is pretty basic; it only does enough to connect to our Puppet configuration management servers. Puppet then configures the boxes as web servers (or databases, or...) and adds them to the appropriate load balancer.