@saaaam, As a tech leader working in the publishing industry, I'd be interested in connecting with you to hear more about what you work on. My email is my username @lp0.org. Maybe there are opportunities for us to collaborate!
As someone who has been down this road many times before - I can't stress this enough: DDoS mitigation solutions don't solve the problem of an app-specific layer7 attack and it is important to do some testing of how well your mitigation service responds (and that it isn't a silver bullet.) Additionally, you need to make sure your team has tested and proven procedures for engaging the service, respond to attacks, etc. Services like NimbusDDoS (www.nimbusddos.com) are good because you can do some real scenario testing and make sure your team and infrastructure is prepared. There are other services out there too that I am less familiar with, but either way really good stuff to do.
We just use RT [1] and TWiki [2]. Changes come in as ticket in RT [1] and we discuses them at change management meetings (document these in the TWiki [2]). We document everthing in the TWiki. If someone comes and asks for our change management policy, we point them at the TWiki, which talks about puppet, and then we can show them the change management minutes, etc. We have a light process and it seems to work.
A case management system and a wiki has typically been how I have done this. It can be a little tough though because these tools aren't necessarily built for this type of workflow. Perhaps RT does a better job than some of the other options which really want to be a support ticketing system or a bug tracking system rather than a change management system.
Yeah, we only have 4 people in ops and about 20 in development. We have daily standup where the ops guys and 1 person from dev meet (total 5 people). We discuss what is happening Past and Next 24 -- this takes about 10 minutes. The process is super light.
We also use puppet with git. This allows us to version everything that goes into production via a puppet tweak. This is great for rolling back changes or getting an of what was deployed. Like I said, read that visible ops handbook.
We have used NetApp for years and decided to move to Pillar Data Systems as they are much more forward thinking, easier to work with and understand storage systems at a very deep level. NetApp wants you to buy new equipment every few years and force this by increasing support costs very quickly.
The 2 disk failure did not cause the outage, but the process the filer head had to go through to get the data back onto new drives and then further actions taken with SnapMirror and other items to try and recover faster.