The secret is in the tests. Every time you break the site you add a new test to make sure it never happens again. As long as you run the tests on parallel you can add virtually unlimited number of tests.
We push to production constantly throughout the day, whenever a new feature, fix, or update is finished and the developer is confident that it works (and the unit tests pass). We have a system set up to do one-click (well, four-click) deployments to all of our servers, plus rolling graceful restarts and point-in-time rollbacks or specific-commit deploys.
We do double-digit deploys daily and as we get more developers that number increases. We did 30 deploys yesterday, and >300 in September, and the amount of time, trouble, and frustration it's saved us is immeasurable.
Sounds like a really great processes and I bet it was worth investing in. When we get to the scale of having enough dev power to do this, while also innovating on the core, certainly! Thanks for sharing.
You don't need "sophisticated testing frameworks" to validate code quality. TDD should be caked into your culture. Even if it's not as detailed as some larger companies.
Pushing code to production should be an automated task that anyone can setup these days using something as simple as Fabric. Heck, code in reverse release scripts too.
And if you have few boxes in your farm/lb then push to half of them, monitor and if all good push to the rest. You should have 0 downtime in this model. Even with 2 instances under a load balancer.
All this requires some initial time expenditure but over the long term saves you a ton of time, headaches, lost sleep etc and allows you to push at anytime of the day.
I definitely agree on the long-term benefits. At the moment, I think we are still changing too many things too fast to keep a non-core set of tests. We are actually using our friends from RainforestQA to do a bulk of testing, which has been really helpful, but some things still fall through the cracks. Like an ajax script failing due to a new caching implementation, that only starts to fail after enough events have taken place. That's when it's really helpful to know that a large number of your users is offline, and they are not all going to trip up at the same time.
Two big companies I've worked have generally followed the same progression of deployment times:
Stage One: Right after work on Friday (until some big client that files all their payments on Friday afternoon complains.)
Stage Two: Late on Friday night (until problems arise and all but one of the mainframe admins have gone camping for the weekend.)
Stage Three: Late on Thursday night (until problems arise the next morning and half of the DBA's are on vacation and the other half are "working from home")
Stage Four: Late on Wednesday night (until problems arise that take the system down for 18 hours and impact clients on Thursday morning.)
I worked at a company that had a policy of no changes to production on Fridays, unless our service was completely down and we were losing people's money.
Given that it was the sort of place where one developer didn't have access to deploy code and had to go walk a senior developer through making the changes by hand (and we didn't use a revision control system), that was probably an optimal circumstance.
So basically, they got to the point where everyone knew that Friday after work was _the_ time, and it was your job as a user to avoid that timeframe? Not ideal, but I see it working well, especially for banks.
The best time to push is the middle of the day of any weekday, possibly when there's a lot of traffic. You'll immediately figure out if there are problems and should already have the setup to revert or have enough tests to know that the basic functionality will be working.
If you push at any other time you'll end up not having any power on finding out when stuff is wrong. Friday or weekend pushes are an absolute bad idea most of the time.
The secret is in the tests. Every time you break the site you add a new test to make sure it never happens again. As long as you run the tests on parallel you can add virtually unlimited number of tests.