It's not bad, but it needs editing. The Joel Test is a hugely important document (for a blog post) because it's incisive. Spolsky could have added, "does your team ban sprintf, strcpy, and strcat" and then written a graf on buffer overflows. He didn't, because that's not one of the very few questions in the Joel Test.
The Joel Test isn't "if Joel Spolsky was designing a new dev team from scratch, here's his whole checklist". But this sysadmin test seems that way.
So in that spirit, here's my first wave of things I think you should cut:
(3) Plenty of very excellent ops team don't keep internal team metrics (other than availability stats). It's also too fuzzy.
(8) Prioritizing features over stability actually contradicts Joel on Software ("some bugs aren't worth fixing", to paraphrase). Saying that you prioritize one over the other is also a platitude. Axe this.
(9) Virtually every great dev team uses source control, has bug tracking, &c. Not every great ops team has "design docs" for every (or even any) project they undertake
(11) Similarly, an "opsdoc" for every service (the mini website with "how to rebuild this") is a nice-to-have, not a must have. How I know that is, I've never once met an ops team that actually has this.
(14) Dev/QA/Prod environments: I don't think you can axe this, but I think you got too aspirational, on two axes: first, most teams don't have dev AND QA AND prod (though every good ops team has at least a prod and a "something else" environment) and secondly, there are services that don't need this much rigor.
(22) Refresh policy for hardware? If it ain't broke, &c. Why does a good sysadmin team refresh hardware just for the hell of it? I remember when network admins used to be proud of keeping highly utilized networks running on the old ugly Cisco AGS+ boxes and made fun of the kids who bragged about their 7500s. This is too fuzzy to be part of a "test".
(23) Why do I care if servers stay up when one hard drive dies, as long as my service stays up even when a whole rack catches fire?
(28) Anti-malware? Really? In 2011? I'm sure you have a whole blog post to write about this, but if you have to justify it, maybe leave it out of your "test".
A suggestion for your document that also adds another acid test to things you should keep on the list: what's an ops team that everyone knows that kicks ass as a result of doing all these things? When Joel Spolsky wrote The Joel Test, he got to use Microsoft as a "12 out of 12" case. Who does all these things? Amazon? (Did I miss that in your document? I'm tipsy, sorry).
I kind of liked the hardware refresh policy. It's not necessarily because the hardware goes stale. It's to keep the process and the personnel from going too stale.
By constantly mixing in new hardware one piece at a time, you compel code to be runnable on multiple generations of hardware at one time, avoiding flag days; you continuously shake the bugs out of new code, preventing it from growing a hardware dependency in year N that only gets discovered in year N+2; you periodically drill the team (especially the newer folks) in the procedure for bringing up new boxes but also accomplish real work (gradual upgrade of the server farm) in the process; you'll end up running hardware with a continuous range of model numbers and batches, perhaps mitigating against flaws that strike entire batches at once; when disaster strikes and you have to replace a box ASAP, odds are better that you've set up a similar box in recent memory and know exactly what to get, how to set it up, and what any pitfalls might be - and if there are pitfalls, you discovered them during working hours on spare hardware, rolled back, and spent a few weeks fixing them instead of discovering them at 5 AM on a Saturday morning and having to fix them on the fly.
(Of course, my devops team does everything in AWS, so what do I know about managing hardware?)
I agree that this checklist is way too long to vie with the Joel Test, though.
What mechanical_fish describes is totally normal, and the result of normal growth and hardware refreshes. A company doesn't start with 0 servers and buy 200. They start with 5, then add 15, then 30, then 50. After three years they get to 200. Then they start replacing them in waves. New generations of hardware become more efficient, you can actually save money by replacing old hardware.
The "platform spread" is inevitable. Besides that it is impossible to physically replace all your servers at once, the bean counters also prefer it to be spread out. They don't want to replace $1,000,000 worth of servers every four years, they would rather replace $250,000 every year.
Sysadmins and devs shouldn't really care if they have five generations of hardware. There are a few things that matter so new hardware fits into the infrastructure, like having IPMI, serial ports for console redirection, enough RAM to run the apps. Try to keep as much as possible the same, like it is way better to only have 3 models of spare power supply to keep in stock vs keeping 20 models. Beyond that I don't think it matters. If every server was a different model it would suck, but having 5 generations is normal and good.
That's interesting. Perhaps the fact that I've learned about ops entirely in the era of virtual machines running atop disposable, generic, and (in the case of AWS) entirely invisible hardware has distorted my thinking on this matter...
"Sysadmins and devs shouldn't really care if they have five generations of hardware."
I wish it was true. sadly there are some services where scale and latency is so carefully measured that individual software releases are rejected if performance gets worse (or unacceptably worse, etc). In these situations you need to test on all hardware platforms. It is much better to have fewer platforms: Optimally: the one you are migrating off of, the one you are moving to.
For desktops... have you ever tried to maintain an Windows or Linux desktop environment with more than 4 "standard desktop configurations"? It becomes a nightmare. If you have a single "gold image" you blast to all machines it makes the task harder; if you stay with the vendor's OS and try to maintain it "forever" it is even worse.
One thing that makes virtualization a "win" is that the virtual box looks like a single hardware platform. It reduces testing, etc. However, then you still need to test the virtualization software on all hardware platforms... so you've made things easier for everyone but that team.
A more detailed reply:
3) really? the core of devops is to be data-driven in your decisions. how can you decide if you are maintaining the right uptime if you don't measure it?
8) I agree with you, as does the text. I think you may have reversed what I wrote.
9) The good ones write so that they "think before they do". On a larger team it is important to communicate what you are about to do, or what you have done. I prefer to write mini design docs. The team I'm on does this and I like it so much I want to spread the word.
11: Again, the team I'm on does this and it works so well I want to spread the word. I see I need to expand this out to explain why, not just how.
14: I'll clarify that the point is not to make big changes on your production system. Whether it is qa+live or a zillion steps including dev, qa, UAT, pre-prod, canary and prod. As long as it isn't zero steps.
22: refresh policy: This is for PCs (non-servers). I'll clarify.
23: The last part makes your point. I'll rewrite to make it more evident
28: "Anti-malware? Really? In 2011? I'm sure you have a whole blog post". Yes, I do: http://everythingsysadmin.com/2011/04/apt.html
Thanks for the reminder to add a link! (and if you are blown away that I had to list this, you can imagine my surprise about finding sites that violated this one!)
You have never seen a team that does all these things? You haven't worked at Google.
PS the comments so far are excellent. I'll have a new draft soon. By the way... The hardware refresh is really about desktop/laptop management in my mind.
Oh... And I'll be teaching a half-day tutorial based on the list at Usenix LISA on Dec 5th!
I have not! If this is a document that says "you'd better be doing 95% of these things, because Google does 100%", you should definitely say that more clearly!
This is good list. I think a key part is "The score doesn't matter as much as attitude." Not every company will have everything on the list, you can't expect small startups to have all this. But if they balk at the ideas it shows that there is a problem.
It turns out that some companies have a management team that is against automation, written policies, and fixing security and stability issues. Here are questions I wish I asked in job interviews:
4. Do you have a "policy and procedure" wiki?
8. In your bugs/tickets, does stability have a higher priority than new features?
16. Do you use configuration management tools like cfengine/puppet/chef?
20. Is OS installation automated?
28. Do desktops/laptops/servers run self-updating, silent, anti-malware software?
> 18. Do automated processes that generate email only do so when they have something to say?
> for question 18: how do you know when your notification system goes down?
You monitor your monitoring system. I think 18 is important, noise from automated processes will hide real problems. I worked with a manager who would consistently write cron jobs that run as root (17), that would send out useless emails every day (18). One of the cron jobs sent 500KB - 10MB of text every day, no one will read 10MB of text, so if there is a an error no one will see it. Write your scripts correctly, use --quiet flags and redirect stdout to /dev/null.
The monitoring servers monitor each other (and themselves), they should be in different data centers. You can also use a third-party service to monitor parts of your infrastructure, including the monitoring server. Depending on your needs a simple service like Pingdom could be used.
If you are wondering how to monitor if both data centers go don't at the same time, I'd say for most companies you don't worry about it. 1) The odds are extremely low. 2) You will notice if two DCs go down. 3) That nightly email that says "I'm up" isn't going to help here. 4) Even the free version of Pingdom will alert you when your whole datacenter is down.
There are all sorts of other things to consider with redundant monitoring, but that's the job of a sysadmin -- identifying failure points, assessing risk, etc.
I'm not sure how email would help. I guess you mean the system would email you once a day saying that the monitoring system is working and if you don't see the email you know to check into it. The monitoring system I use has a higher SLA than 24 hours.
Usually folks divide the monitoring work among two servers and each server monitors the other. Or, you "meta monitor"... a monitoring system that just monitors the monitoring system. Then you get a third-party to monitor that. Then it is turtles all the way down.
It sounds like both of you are talking about cloud stuff when thinking about datacenters. I can see how that many layers and cross-checks would be necessary when all you really control is running memory and some pieces of storage, but a lot of that is due to the platform. When you control the actual metal, third-party monitoring services are much less necessary.
For a real DC, when it goes down I get a phone call from a human. I don't have to reinvent that process. If it's my own server room, I use a landline and a modem for OOB "dude you gotta come down here" notifications, a WAV of Woody Woodpecker or something. If the phone lines are down, I look at the newspaper headlines to see what happened.
There's no reason not to set up a standalone monitoring regime. Whether or not you use heartbeat notifications to tell you all is well is a matter of taste, but there is definitely more to maintaining your nines on a daily basis than simply adding more layers of monitoring.
The Joel Test isn't "if Joel Spolsky was designing a new dev team from scratch, here's his whole checklist". But this sysadmin test seems that way.
So in that spirit, here's my first wave of things I think you should cut:
(3) Plenty of very excellent ops team don't keep internal team metrics (other than availability stats). It's also too fuzzy.
(8) Prioritizing features over stability actually contradicts Joel on Software ("some bugs aren't worth fixing", to paraphrase). Saying that you prioritize one over the other is also a platitude. Axe this.
(9) Virtually every great dev team uses source control, has bug tracking, &c. Not every great ops team has "design docs" for every (or even any) project they undertake
(11) Similarly, an "opsdoc" for every service (the mini website with "how to rebuild this") is a nice-to-have, not a must have. How I know that is, I've never once met an ops team that actually has this.
(14) Dev/QA/Prod environments: I don't think you can axe this, but I think you got too aspirational, on two axes: first, most teams don't have dev AND QA AND prod (though every good ops team has at least a prod and a "something else" environment) and secondly, there are services that don't need this much rigor.
(22) Refresh policy for hardware? If it ain't broke, &c. Why does a good sysadmin team refresh hardware just for the hell of it? I remember when network admins used to be proud of keeping highly utilized networks running on the old ugly Cisco AGS+ boxes and made fun of the kids who bragged about their 7500s. This is too fuzzy to be part of a "test".
(23) Why do I care if servers stay up when one hard drive dies, as long as my service stays up even when a whole rack catches fire?
(28) Anti-malware? Really? In 2011? I'm sure you have a whole blog post to write about this, but if you have to justify it, maybe leave it out of your "test".
A suggestion for your document that also adds another acid test to things you should keep on the list: what's an ops team that everyone knows that kicks ass as a result of doing all these things? When Joel Spolsky wrote The Joel Test, he got to use Microsoft as a "12 out of 12" case. Who does all these things? Amazon? (Did I miss that in your document? I'm tipsy, sorry).
Hope that's constructive.