More than anything else, this describes an appalling failure at every level of the company's technical infrastructure to ensure even a basic degree of engineering rigor and fault tolerance. It's noble of the author to quit, but it's not his fault. I cannot believe they would have the gall to point the blame at a junior developer. You should expect humans to fail: humans are fallible. That's why you automate.
More than that, it's telling that the company threw him under the bus when it happened. I've been through major fuckups before, and in all cases the team presents a united front - the company fucked up, not an individual.
Which is, if you think about it, true, given that the series of events leading up to the disaster (the lack of a testing environment, working with prod databases, lack of safeties in the tools used to connect to database, etc...).
The correct way to respond to disasters like this is "we fucked up", not "someone fucked up".
When things like this happen, you have to realize there is more than one 'truth'.
There is the truth that here is someone who truncated the users table and because of that it caused the company great harm.
Here's another 'truth'.
1. The company lacked backups
2. The junior developer was on a production database.
Note: I'm from the oldschool of sysops who feel that you don't give every employee the keys to your kingdom. Not every employee needs to access every server, nor do they need the passwords for such.
3. Was there a process change? I doubt it, likely they made the employee feel like a failure every day and remind him of how he can be fired at any moment. So he did the only thing he could do to escape: Quit!
Horrible and wrong, if there was a good ethics lawyer around he would say it smells ripe of a lawsuit.
... That said, that lawyer and lawsuit won't fix anything.
The problem isn't who has the keys, it's how they're used. I don't care as much if a junior developer has the prod password; I care more about building an engineering and ops team that understands that dicking around with the prod database isn't okay. Sysops and DBAs are fallible too--I've seen a lot of old school shops that relied heavily on manual migration and configuration. Automate, test, isolate and expect failure!
And, most importantly, shake off the "agile" ethos when doing DB migrations. Just forget they exist and triple-check every character you type into the console.
What, exactly, do you think the "agile" ethos is?
We are agile, but that does _not_ mean that we don´t triple-check everything we type into production consoles.
It does, however, mean that everything we do in production has been done before in at least one test environment. Why would we want to forget about doing that?
The fact that the cheap ass idiots at the company had cancelled their backup protection at Rackspace, and lacked any other form of backup is just complete incompetence.
If it hadn't been a junior developer who nobody noticed or cared was using the prod DB for dev work, it would have been an outright failure. DBs fail, and if you don't have backups you are not doing your damn job.
The CEO should be ashamed of himself, but the lead engineers and the people at the company who were nasty to this guy should all be even more ashamed of themselves.
I have to chime in and completely agree. Very lucky. Most people who survive for years at companies have learned to either stay out of sight, or navigate the Treacherous Waters of Blame whenever things go wrong.
This is actually one of the things most employees who have never been managers don't understand.
Your comment makes me think. Are you implying that this is a good practice?
I mean, in fact I do something similar. At our company also a lot of stuff goes wrong. Somehow it surprises me that there was no major fuckup yet. But I do realize that I need to watch out all times that blame never concentrates on me.
It is so easy to blame individuals, it just suffices to have participated somehow in a task that fucked up. Given that all other participants keep a low profile, one needs to learn how to defend/attack in times of blame.
I absolutely do NOT think it is a good practice. I think it is what lazy companies full of people afraid to lose their jobs do. I think it's most companies.
The reality is that fear is a greater motivator than any other emotion - over anger, sorrow, happiness. So companies create cultures of fear which results in productivity (at least a baseline, 'do what I need to do or not get fired' productivity), but little innovation and often at the expense of growth.
Plus, it's just hell. You want to do great things, but know you are stepping into the abyss every time you try.
You (and the other commenters with similar strategies) are wasting productive years of your life at jobs like these. You should go on a serious job hunt for a new position, and leave these toxic wastelands before they permanently affect your ability to work in a good environment.
I wish I could say I have some sort of prescience about bad working environments. In reality I'm just good at getting out while the gettin' is still good.
Apparently their logo is ";-)". At first I thought they just had this annoying tendency to overuse the wink emoticon and found it a bit creepy. Then I saw it all over their menus and found it a bit creepy. Then I realized they appeared to be using it as their logo and I found it creepy.
When you think about it it's almost a logical certainty that he'd take the fall. Any company collectively able to understand that the actual failure was inadequate safeguards would have been able to see it coming and presumably would have prevented it from happening. If you're so inexperienced that you expect no one will ever make a mistake you'll obviously assume that the only problem was that someone made one.
It's somewhat fascinating to hear that a company that damaged managed to build a product that actually had users. And I thought my impression of the social gaming segment couldn't go any lower.
Indeed. The news here is not that a junior developer made an error with SQL db (everyone makes them, eventually), nor that the company did not have proper safeguards against problems (this happens far too often). The news is that despite such basic management incompetence, the company had been able to get a large number of paying customers.
Just think, the company spent millions of dollars training this guy in protecting production data, and then instead of treating him like gold, and putting him in charge of fixing all the weaknesses that allowed this to happen, they pushed him out of the company. Stupidity beyond measure.
Not surprised that this was a gaming company because lack of teamwork seems to be endemic in that industry.
The correct way to respond to disasters like this is "we fucked up", not "someone fucked up".
can't agree more! a company that starts playing the blame game - and even communicates this to the outside a) looks unprofessional b) is unprofessional c) poisons the corporate culture
its a fail on every possible level, the technical part beeing only minor.
Agreed. How could a company with "millions in revenue" not backup critical databases? Not only were they exposed to the threat of human error, but hardware failures, hackers, etc. When he submitted his resignation, the company should have encouraged him to stay. Instead, anyone at the company that had anything to do with the failure to implement regular database backups and the use of redundant databases should have been fired.
I've had more than my fair share of failures in the start up world. It always drives me crazy to see internet companies that have absolutely no technical ability succeed, while competing services that are far superior technically never get any traction.
It's the difference between understanding the incident (programmer makes mistake) and the root cause (failing at data integrity 101). Not just no backups, but no audit log of what's happened in the game - I'd expect an MMO to have an append-only event history quite apart from their state information.
At least all transactions for purchased game items should have been logged in a separate database. There's nothing worse for a digital content company than forgetting who bought what.
Spending $5k on a UPS is very different from not having backups of your production database which runs your multimillion dollar business. This story just doesn't add up.
You misunderstood the comment. Angersock is saying that the networking equipment cost more than $5k and the friend was unwilling to buy a UPS to protect the equipment.
Sounds completely possible to me - I have worked in a lot of environments and at some places I've seen decisions that make cancelling backups look like an act of genius.
One interesting observation I can make: no correlation between excellence in operations and commercial success!
I've seen something like this. It could be a situation where they were transitioning from that backup service to their own server or another service, and the move was never completed due to some hiccup or priority change.
But if no one is watching to ensure the move was finished (or they got distracted), then something many people treat as set-it-and-forget-it could easily get into that state.
If you are transitioning from one backup service to another, shouldn't you only cancel your old service, after you successful set up and tested the new one?
I don't think it is that far-fetched. I've had experience with a well known company that does millions in revenue per week (web based shopping cart) that just FTP's everything with no DVCS. Designers, developers, managers all have access to the server and db.
It's not just noble. Considering how everyone treated him, and the company's attitude in general, he had no future there, and neither should anyone else.
This really needs to be more of a standard thing. I've been near (but as an engineer, never responsible for) production systems my whole career. None of these systems were as terribly maintained as the one in the linked article. Production data was isolated. Backups were done regularly. Systems were provisioned with fault tolerance in mind.
Not once have I seen a full backup restore tested. Not once have I seen a network failure simulated (though I've seen several system failures due to "kicking out a cable" that sort of acts as a proxy for that technique). On multiple occasions I've seen systems taken down by single points of failure[1] that weren't forseen, but probably could have been.
[1] My favorite: the whole closet went down once because everything was plugged into a single, very expensive, giant UPS that went poof. $40/system for consumer batteries from Office Depot would have been a much better bet. And the best part? Once the customer service engineer replaced whatever doodad failed and brought the thing back up? They plugged everything right back into it.
I'll never forget when my boss was showing the girl scouts (literally) our very expensive UPS room. He explained how even if the power goes off we'll switch to batteries then switch over to generator power. See watch, he says - then flicked the switch. Fooomm... our entire office goes dark.
This took down news information for a good chunk of Wellington finance for about half a day. (Fortunately Wellington, NZ is a tiny corner of the finance world).
Hilarious! But I admit I was super glad it was the boss playing chaos monkey, not me.
Back when I worked for a small ISP, we had a diesel generator in case the power went out longer than our UPS batteries would last. This provided a great sense of security until we decided to test the system by powering off the main break and... it didn't start!
It turns out the emergency stop button was pushed in. Easy enough for us to fix then, but if the power had gone out at 4am it would have been quite another matter.
After that incident, we turned off the main breaker to the building weekly. It was great fun, as most of our offices were in the same building. We had complaints for the first couple of months until everyone got used to it and had installed mini UPS's for their office equipment.
We did actually have to use the generator for real a while later. Someone had driven their car into the local power substation, and it was at least a month until it was fixed. Electricity was restored through re-routing fairly quickly, but until the substation was repaired we were getting a reduced voltage that caused the UPSs to slowly drain...
The last time they tested the diesel generator failover at a customer's site, the generator went on just fine, but then it did not want to switch to mains again. The whole building was powered by the generator for almost two days, until they managed to convince the generator to switch.
> Not once have I seen a network failure simulated.
Reminds me of the webserver UPS setup at a previous company.
The router (for the incoming T1) and the webserver were plugged in to the UPS.
UPS connected (via serial port) to webserver. Stuff running on webserver to poll whether UPS running from mains power or batteries and send panic emails if on batteries (for more than 60 seconds) and eventually shutdown the webserver cleanly if UPS power dropped below 25%.
Thing not plugged in to UPS: DMZ Network switch (that provided the connectivity between webserver and router).
Doing that kind of testing is hard. It costs time and effort. If you want to see it done on a truly awe-inspiring scale (whole data centers being taken down by zombies ;) : http://queue.acm.org/detail.cfm?id=2371516
Doing this kind of testing in a gold-plated, heavily-engineered way is hard. But that's not an excuse for not doing it at all. Just walking into your closet and pulling a cable gets you 80-95% of the testing you need, and is free. Setting up a sandbox and "restoring" a backup onto it and then doing some quick queries is likewise easy to do and eliminates huge chunks of the failure space of "bad backups".
Really, this attitude (that things have to be done right) is part of the problem here. To a seasoned IT wonk, the only alternative to doing something "The Right Way" is not doing it at all. And that's a killer in situations like these.
Don't hack your systems to make them work. Absolutely do hack at them to test.
"walking into your closet and pulling a cable" is not free, if your planned disaster recovery is not a seamless failover, but a process to recover data with some work and limited (nonzero) downtime/cost to business.
For example, our recovery plan for a financial mainframe in case of most major disasters was to restore the daily backup to off-site hardware identical to production hw; however, the (expensive) hardware wasn't "empty" but used as an acceptance test environment.
Doing a full test of the restore would be possible, but it would be a very costly disruption; taking multiple days of work for the actal environment restoration, deployment,testing and then all of this once more to build a proper acceptance-test-environment. Also destroying a few man-months worth of long tests-in-progress and preventing any change deployments while this is happening.
All of this would be reasonable in any real disaster, but such costs and disruptions aren't acceptable for routine testing.
"Chaos Monkey" works only if your infrastucture is built on cheap unstable and massively redundant items. You can also get excellent uptime with expensive, stable, massively controled environment with limited redundancy (100% guaranteed recovery, but not "hot failover") - but you can't afford chaos there.
To paraphrase: if you go with a awful hack job for your disaster recovery plan, testing is more expensive. And to extend: you won't actually test because it's "too expensive", and your disaster recovery plan won't work.
How is this distinct from "Don't hack your systems to make them work. Absolutely do hack at them to test."? I don't see it.
This just sounds like "my business doesn't have the financial capacity to engineer data recovery processes". Well, OK then. Just don't claim to be doing it.
We did know that we can recover backups because we did it for small parts of data, and we know that we can do disaster recovery because (a) we did test this, though very rarely; and (b) we had successfully recovered from actual full-scale disasters twice over ~7 years.
But successful, efficient disaster recovery plan doesn't always mean "no damage" - it often means damage mitigation; i.e., we can fix this with available resources while meeting our legal obligations so that our customers don't suffer; not that there aren't consequences at all - valid data recovery plans ensure that data recovery really is possible and details how it happens, but that recovery can be expensive. And while you can plan, document, train and test activities like "those 100 people will do X; and those 10 sales reps will call the involved customers and give them $X credit", you really don't want to put the plan into action without a damn good reason.
For example, a recovery plan for a bunch of disasters that are likely to cut all data lines from a remote branch to HQ involves documenting, printing & verifying a large pile of deal documents of the day, having them shipped physically and handled by a designated unit in the HQ. The process has been tested both as a practice and in real historical events.
However, if you "pull a wire in the closet" and cause this to happen just so, then you've just 'gifted' a lot of people a full night of emergency overtime work, and deserve a kick in the face.
All I can say is that you're very lucky to have a working system (and probably a company to work for), and I'm very lucky not to work where you do. Seriously, your "test" of a full disaster recovery was an actual disaster! More than one!
And frankly, if your response to the idea of implementing dynamic failure testing is that someone doing that should be "kicked in the face" (seriously, wtf? even the image is just evil), then shame on you. That's just way beyond "mistaken engineering practice" and well on the way to "Kafkaesque caricature of a bad IT department". Yikes.
Admittedly: you have existing constraints that make moving in the right direction expensive and painful. But rather than admit that you have a fragile system that you can't afford to engineer properly you flame on the internet against people who, quite frankly, do know how to do this properly. Stop.
I'd like not to stop, but continue exploring the viewpoints. And I'd like you and others to try and consider also less-tech solutions to tech problems if they meet the needs instead of automatically assuming that we made stupid decisions.
For example, any reasonable factory also has a disaster recovery process to handle equipment damage/downtime - some redundant gear, backup power, inventory of spare parts, guaranteed SLA's for shipping replacement, etc; But still, someone intentionally throwing a wrench in the machine isn't "dynamic failure testing" but sabotage that will result in anger from coworkers who'll have to fix this. Should their system be called "improperly engineered"?
We had great engineers implementing failover for a few 'hot' systems, but after much analysis we knowingly chose not to do it 'your way' for most of them since it wasn't actually the best choice.
I agree, in 99% of companies talked about in HN your way is undoubtedly better, and in tech startups it should be the default option. But there, much of the business process was people & phone & signed legalese, unlike any "software-first" businesses; and the tech part usually didn't do anything better than the employees could do themselves, but it simply was faster/cheaper/automated. So we chose functional manual recoveries instead of technical duplications. And you have to anyway - if your HQ burns down, who cares if your IT systems still work if your employees don't have planned backup office space to do their tasks? IT stuff was only about half of the whole disaster recovery problems.
In effect, all the time we had an available "redundant failover system" that was manual instead of digital. It wasn't fragile (it didn't break, ever - as I said, we tried), fully functional (customers wouldn't notice) but very expensive to run - every hour of running the 'redundant system' meant hundreds of man-hours of overtime-pay and hundreds of unhappy employees.
So, in such cases, you do scheduled disaster-testing and budget the costs of these disruptions as neccessary tests - but if someone intentionally hurts his coworkers by creating random unauthorised disruptions, then it's not welcome.
The big disadvantage for this actually is not the data recovery or systems engineering, but the fact that it hurts the development culture. I left there because in such place you can't "move fast and break things", but everyone tends to ensure that every deployment really, really doesn't break anything. So we got there very good system stability, but all the testing / QA usually required at least 1-2 months for any finished feature to go live - which fits their business goals (stability & cost efficiency rather than shiny features) but demotivates developers.
My favourite way to test restores is to do them frequently to the dev server from the production backups - this keeps the dev data set up to date, and works as a handy test of the restore mechanism. Of course if you have huge amounts of data or files on production this becomes more difficult, but not impossible, to manage.
This works well, though you may need an "anonymizer" (and maybe some extra compliance testing) if your systems have PCI or HIPPA data on them. We have federal restrictions against storing certain types of data on servers outside the US. Cloud computing sounds great but neither Amazon or Google will guarantee the data stays within the country's borders.
Minor correction: the Chaos Monkey was Netflix's innovation. It just happened to be implemented on Amazon's cloud. It would have been just as useful if they had their own colocated servers or used a different cloud computing provider.
Apple did this before Amazon or Netflix in this regard [1], but the point needs to be made that a system needs to be tested and not just in a controlled aseptic way, because the real world isn't.
Another story supporting Chaos Monkey is what the Obama team did for their Narwhal infrastructure - they staged outages and random failures to prepare for their big day, meanwhile Romney's team who outspent the Obama team at least an order of magnitude, had their system fail on e-day.
I'd like to see a source for Romney outspending the Obama team "at least" 10x, because while I can speak from experience that ORCA was a gigantic piece of shit, it's not like the Obama people were struggling to pay their bills.
I don't know what metric the parent comment is referring to, but in terms of technology stack, I can fully believe that the Romney team spent more than Obama's team. Here's a post by one of the creators of the fundraising platform:
I actually had that post in my mind when writing my reply, but I assumed r00fus was referring to ORCA and Narwhal specifically.
> ... what the Obama team did for their Narwhal infrastructure - they staged outages and random failures to prepare for their big day, meanwhile Romney's team who outspent the Obama team at least an order of magnitude, had their system fail on e-day.
Exactly what you said. Rigor is truly the right word to use here. Cancelling your db backups is basically asking for a disaster. I'm not sure i have ever been at job that didnt require a db backup for some reason at some time.
This was my question. Usually the "junior" folks are shown how to do things by the senior engineers. The fact they threw this guy under the bus while letting the rest of the senior guys skate is appalling.
Part of your job as a senior developer is to ensure this very scenario doesn't happen, let alone to someone on your watch.
The author makes mention of using a UI to connect to their db. If i was in a position over there i can see myself writing a script to clear out the tables i wanted. This reduces errors, but not the risk.
And yet, the net result is the same; right-click, clear table, or ./clearTable.sh. Both are human actions, and both are fallible. What if some prankster edited clearTable.sh to do the users table instead of the raids table? What if he did it himself to test something?
Heck when you put it that way, this guy actually did them a FAVOR. He ONLY wiped out the User table. The company was able to learn the value of backups, and they had enough data left to be able to partially recover it from the remaining tables, which is much better than the worst case scenario.
Do you think the company learned the value of backups, or do you think they learned to blame junior devs for fuckups? Sounds like they learned nothing, because no attempt was made to determine the root cause.
Exactly. I'm ashamed that my first reaction reading this was to blame OP. But in the 2 min it took to read the post I had come full circle to wondering what kind of terribly run company would allow this to happen--I guess the type that hires philosophy majors straight out of college without vetting their engineering skills.
"Nobody cares about technical infrastructure. Our customers don't pay us for engineering rigor. We need to just ship!"
Of course the person saying that is likely to care about technical infrastructure when it costs them money and/or customers due to being hacked-together.
When I was a junior analyst, I once deleted the main table that contained all 70k+ users, passwords, etc. The problem was fixed in 15min after the DBA was engaged to copy all data back from the QA environment that was synchronized every X minutes. Or we could have restored a back from a few hours ago.
The whole company fucked this one up pretty badly. NO excuses.
Indeed. If it had been a sporadic hardware failure, they would have been exactly as screwed. The fact that they gave an overworked junior dev direct read/write access to the production database is astounding.
It's not that they gave him r/w at all that's so criminally stupid. It's that the required him to clear a table manually, using generic full access tools, over and over.
In reality, this should have been re-factored to the dev db.
If it couldn't be, the junior dev should have been given access to the raids table alone for writes.
Lastly the developer who didn't back up this table is the MOST to blame. Money was paid for the state in that table. That means people TRUST you to keep it safe.
I count tons of people to blame here. I don't really see the junior dev as one of them.
Yeah I think him manually clearing a table over and over again was the big problem here. The amount of entropy that had to be introduced into the process to turn a routine task into millions of dollars of loss was tiny. He just needed to click in a slightly different spot on the webpage to bring up USERS instead of RAIDS.
I worked for a startup a few years back where the CEO deleted the entire 1+TB database/website when the server was hacked and being used as a spam server because he 1. didn't know who to disable the site short of deleting and 2. could reach anyone that did know how.
The next morning he told us to restore the site from backups and fix the security hole. That's when we reminded him, again, that he had refused to pay for backup services for a site of that size.
We all ended up looking fro new jobs withing a couple days.
If you have only one copy of data, especially if it is important, the chance of something happening to that copy, either hardware, software, or human error, is always big enough to justify a backup. No hindsight needed for that.
Thinking about this just makes my mind go numb. All someone had to do was have the idea that they should back their database up. It would be done in 5 minutes, tops!
I automated ours on S3 with email notifications in under an hour...
Exactly. Problems with production databases are inevitable. It's just a matter of time.
The guy who should be falling on the sword, if anyone, is the person in charge of backups.
Better yet, the CEO or CTO should have made this a learning opportunity and taken the blame for the oversight + praised the team for banding together and coming up with a solution + a private chat with the OP.
Unbelievable. Even at my 3 person startup, back in 2010, with thousands in revenue, not millions, we had development environments with test databases and automated daily database snapshotting. Sure I've accidentally truncated a few tables in my time, but luckily I wasn't dumb enough to be developing on a production server.
I cannot even begin to fathom how they functioned without a working development environment for testing, let alone let their backups lapse.
The kind of table manipulations he mentions would be unspeakable in most companies. Someone changing the wrong table would be inevitable. If I were an auditor, I would rake them all over the coals.
Absolutely. Your immediate technical management sucked, and you were made the scape goat for your management's failure. Welcome to the real world.
Don't get me wrong, you should feel bad, very bad, bad enough so that never again you do that. But you shouldn't feel guilty nor rethink your career.
A little bit of feeling guilty is in order; the author "didn't know that he didn't know", and I'm sure this motivated him to learn a lot more about proper engineering processes ... something I'll note aren't particularly a focus in CS degrees. Especially since I haven't come across anyone who's really dedicated to them who hadn't first gotten burned in one way or another.
There's a big difference between being told "Do X, don't do Y" and that sinking feeling you get when you realize a big problem exists, regardless of the eventual outcome.
You're forgetting the guy who didn't speak up to say "Hey, maybe we shouldn't do this in prod?".
I feel for him, but at the same time there's a point at which you have to if testing guns by shooting them near (but not specifically at) your coworkers is actually a good idea.
Agree with the point that humans are fallible. We should always have back ups. A company with 1000s of paying customers should at least take steps to protect itself from this sort of catastrophe.
To the credit of the management, they did not fire him. He resigned. But the coworkers felt he was responsible personally. That makes a uneasy work environment.
If you are a CEO you should be asking this question: "How many people in this company can unilaterally destroy our entire business model?"
If you are a CTO you should be asking this question: "How quickly can we recover from a perfect storm?"
They didn't ask those questions, they couldn't take responsibility, they blamed the junior developer. I think I know who the real fuckups are.
As an aside: Way back in time I caused about ten thousand companies to have to refile some pretty important government documents because I was doubling xml decoding (& became &). My boss actually laughed and was like "we should have caught this a long time ago"... by we he actually meant himself and support.
One day Intel's yields suddenly went to hell (that's the ratio of working die on a wafer to non-working, and is a key to profitability). And no matter how hard they tried, they could only narrow it down to the wafers being contaminated, but the wafer supplier swore up and down they were shipping good stuff, and they were. So eventually they tasked a guy to follow packages from the supplier all the way to the fab lines, and he found the problem in Intel's receiving department. Where a clerk was breaking open the sealed packages and counting out the wafers on his desk to make damned sure Intel was getting its money worth....
His point is that you can have a Fortune 500 company, normally thought to be stable companies that won't go "poof" without ample warning, in which there are many more people than in previous kinds of companies who can very quickly kill it dead.
I physically cringed at that. Even the mail clerk should have noticed there was a 'big deal' about clean rooms and had some idea what the company he worked for did...
We are all born naked, bloody, and screaming; the only thing we know is how to work a nipple. Everything else has to be learned.
One of Toyota's mantras is "If the student has failed to learn, the teacher has failed to teach." Their point is that managers are responsible for solving issues that come from employee ignorance, not line workers.
Exactly. In my organization I know of perfectly good hardware that is either being tossed or used for non-critical applications because someone didn't follow the Incoming Inspection process correctly. It doesn't matter that they could simply be inspected now and found to be perfect, the process wasn't followed, so the product is "junk."
We may be quibbling over definitions. But do you believe that no mammal has an instinctual understanding of using nipples? Or that humans are unique among mammals in that regard?
I'm afraid I have no idea about the answer to either of your questions. I assumed the "we" in your original post meant "we humans".
I do understand that humans don't have an instinctual understanding of using nipples, human babies have a physical sucking reflex that kicks in when you put anything near their mouths. They usually quickly learn that sucking a nipple in a particular way gives out lots of yummy milk.
Where a clerk was breaking open the sealed packages and counting out the wafers on his desk to make damned sure Intel was getting its money worth....
I find this hard to believe. At some point a person in a space suit was introducing them into a clean room; she should have noticed that the packages were not sealed.
Maybe the clerk was carefully sealing them back up again? In retrospect they probably should have had tamper-proof seals given the value attached to the wafers being delivered unopened, but then most problems are easily avoidable in retrospect.
Exactly; this happened early enough in the history of ICs that I'm sure they weren't taking such precautions, just like tamper revealing seals for over the counter drugs didn't become big until after the Chicago Tylenol murders.
There's a good chance a tamper revealing seal would have stopped the clerk from opening the containers, and of course it they'd been broken before reaching the people at the fab lines who were supposed to open them that would have clued Intel into the problem before any went into production and would have allowed them to quickly trace the problem back upstream.
Or in this case energy, ignorance, and not learning enough of the big picture, or wondering why these wafers were sealed in air tight containers in the first place.
It's the well meaning mistakes that tend to be the most dangerous since most people are of good will, or at least don't want to kill the company and lose their jobs.
The difference between working with bad bosses and good bosses really shows itself when there's a disaster going on.
A mere few months into my current job, I ran an SQL query that updated some rows without properly checking one of the subqueries. Long story short - I fucked up an entire attribute table on a production machine (client insisted they didn't need a dev clone). I literally just broke down and started panicking, sure that I'd just lost my new job and wouldn't even have a reference to run on for a new one.
After a few minutes of me freaking out, my boss just says: "You fucked up. I've fucked up before, probably worse than this. Let's go fix it." And that was it. We fixed it (made it even better than it was, as we fixed the issue I was working on in the process), and instead of feeling like an outcast who didn't belong, I learned something and became a better dev for it.
I really wish this was at the top. Everyone will fuck up at some point (even your best engineer). Whether you learn anything from fucking up really determines how bad your mistake was.
This is so true. I had to leave my first junior dev position for similar reasons as the OP, though nothing as monumental.
I was handed a legacy codebase with zero tests. I left a few small bugs in production, and got absolutely chewed out for it. It was never an issue with our processes, it was obviously an issue with the guy they hired who had 1 intro CS class and 1 rails hobby project on his resume. The lead dev never really cared that we didn't have tests, or a careful deploy process. He just got angry when things went wrong. And even gave one more dev with even less experience than I had access to the code base.
It was a mess and the only thing I was "learning" was "don't touch any code you don't have to because who knows what you might break" which is obviously a terrible way to learn as a junior dev (forget "move fast and break things" we "move slowly and work in fear!"). So I quit and moved on, it was one of the better decisions I've ever made.
As a programmer I consider myself very lucky that one of the first advices I got when I was a junior was from one of my senior colleagues (and a very smart guy): "one of the most valuable qualities of a good programmer is courage".
Seven and a half years later I make sure that I pass that knowledge on to my junior colleagues. I'm proud to say that just in the past 2 weeks I've said this twice to one of my younger team-mates, a recent hire, "don't be afraid to break things!"
Heh. Another way I would say that is that "discretion is the better part of valor" And, to rip-off Hitchhiker's Guide to the Galaxy, "cowardice is the better part of discretion"...in that if fear makes you judiciously check your backups and write tests, then that's not a bad thing at all.
Paranoia is often useful, but with a good enviorment, and tools it's rarely need. I find bad assumptions often cause the worst problems. Break things regularly and you end up with fewer assumptions about the code base / production environment which is a vary good thing.
This is the right thing to encourage but I just would like to add always have a backup.
"Don't be afraid to break things as long as you have a backup".
It might be a simple version of the previous code, database copy or even the entire application. Do not forget to backup. If everything fails, we can quickly restore the previous working version.
Every production deployment should involve blowing away the prior instance, rebuilding from scratch, and restarting the service; you are effectively doing a near-full "restore" for every deployment, which forces you to have everything fully backed up and accessible...
Any failure to maintain good business continuity practices will manifest early for a product / employee / team, which allows you to prevent larger failures...
Spoken like a man who has maintained applications but never databases.
In the world where data needs to be maintained, this is not necessarily an option. In the bank where I work, we deploy new code without taking any outage (provide a new set of stored procedures in the database, then deploy a second set of middleware, then a new set of front-end servers, test it, then begin starting new user sessions on the new system; when all old user sessions have completed the old version can be turned off). Taking down the database AT ALL would require user outages. Restoring the database from backup is VERY HARD and would take hours (LOTS of hours).
That being said, we do NOT test our disaster-recovery and restore procedures well enough.
This is correct and is the way it should be. So how come the programmers are always politically gunning for the keys to the production server cabinet, where you do have to be afraid to break things?
Are they? Where I work (a bank) we were more than happy to move to read-only access to prod servers, and pass by a support team when we need to deploy things.
slowly and work in fear businesses usually collapse. the technical problem eventually becomes a business problem and no one realizes it can be solved at the source level.
it first builds inefficieny at the technical level then at the business level, finally it causes issues at the cultural level, and thats when the smart people start leaving.
>If you are a CEO you should be asking this question: "How many people in this company can unilaterally destroy our entire business model?"
This is a question that the person in charge of backups needs to think about, too. I mean, rephrase it as "Is there any one person who can write to both production and backup copies of critical data?" but it means the same thing as what you said.
(and if the CTO, or whoever is in charge of backups screws up this question? the 'perfect storm' means "all your data is gone" - dono about you, but my plan for that involves bankruptcy court, and a whole lot of personal shame. Someone coming in and stealing all the hardware? not nearly as big of a deal, as long as I've still got the data. My own 'backup' house is not in order, well, for lots of reasons, mostly having to do with performance, so I live with this low-level fear every day.)
Seriously, think, for a moment. There's at least one kid with root on production /and/ access to the backups, right? At most small companies, that is all your 'root-level' sysadmins.
That's bad. What if his (or her) account credentials get compromised? (or what if they go rogue? it happens. Not often, and usually when it does it's "But this is really best for the company" It's pretty rare that a SysAdmin actively and directly attempts to destroy a company.)
(SysAdmins going fully rogue is pretty rare, but I think it's still a good thought experiment. If there is no way for the user to destroy something when they are actively hostile, you /know/ they can't destroy it by accident. It's the only way to be sure.)
The point of backups, primarily, is to cover your ass when someone screws up, primarily. (RAID, on the other hand, is primarily to cover your ass when hardware fails) - RAID is not Backup and Backup is not RAID. You need to keep this in mind when designing your backup, and when designing your RAID.
(Yes, backup is also nice when the hardware failure gets so bad that RAID can't save you; but you know what? that's pretty goddamn rare, compared to 'someone fucked up.')
I mean, the worst case backup system would be a system that remotely writes all local data off site, without keeping snapshots or some way of reverting. That's not a backup at all; that's a RAID.
The best case backup is some sort of remote backup where you physically can't overwrite the goddamn thing for X days. Traditionally, this is done with off-site tape. I (or rather, your junior sysadmin monkey) writes the backup to tape, then tests the tape, then gives the tape to the iron mountain truck to stick in a safe. (if your company has money; if not, the safe is under the owner's bed.)
I think that with modern snapshots, it would be interesting to create a 'cloud backup' service where you have a 'do not allow overwrite before date X' parameter, and it wouldn't be that hard to implement, but I don't know of anyone that does it. The hard part about doing it in house is that the person who managed the backup server couldn't have root on production and vis-a-vis, or you defeat the point, so this is one case where outsourcing is very likely to be better than anything you could do yourself.
> If there is no way for the user to destroy something when they are actively hostile, you /know/ they can't destroy it by accident.
Which also means they can't fix something in case of a catastrophic event. "Recover a file deleted from ext3? Fix a borked NTFS partition? Salvage a crashed MySQL table? Sorry boss, no can do - my admin powers have been neutered so that I don't break something 'by accident, wink wink nudge nudge'." This is, ultimately, an issue of trust, not of artificial technical limitations.
> one case where outsourcing is very likely to be better than anything you could do yourself.
Hm. Your idea that "cloud is actually pixie dust magically solving all problems" seems to fail your very own test. Is there a way to prevent the outsourced admins from, um, destroying something when they are actively hostile? Nope, you've only added a layer of indirection.
>> If there is no way for the user to destroy something when they are actively hostile, you /know/ they can't destroy it by accident.
>Which also means they can't fix something in case of a catastrophic event. "Recover a file deleted from ext3? Fix a borked NTFS partition? Salvage a crashed MySQL table? Sorry boss, no can do - my admin powers have been neutered so that I don't break something 'by accident, wink wink nudge nudge'." This is, ultimately, an issue of trust, not of artificial technical limitations.
All of the problems you describe can be solved by spare hardware and read only access to the backups. I mean, your SysAdmin needs control over the production environment, right? to do his or her job. but a sysadmin can function just fine without being able to overwrite backups. (assuming there is someone else around to admin the backup server.)
fixing my spelling now.
Yes, it's about trust. but anyone who demands absolute trust is, well, at the very least an overconfident asshole. I mean, in a properly designed backup system (and I don't have anything at all like this at the moment) I would not have write-access to the backups, and I'm majority shareholder and lead sysadmin.
That's what I'm saying... backups are primarily there when someone screwed it up... in other words, when someone was trusted (or trusted themselves) too much.
Okay, now I think I understand you, and it seems we're actually in agreement - there is still absolute power, but it's not all concentrated in one user :)
>Hm. Your idea that "cloud is actually pixie dust magically solving all problems" seems to fail your very own test. Is there a way to prevent the outsourced admins from, um, destroying something when they are actively hostile? Nope, you've only added a layer of indirection.
the idea here is to make sure that the people with write-access to production don't have write-access to the backups and vis-a-vis. The point is that now two people have to screw it up before I lose data.
Outsourcing has it's place. You are an idiot if you outsource production and backups to the same people, though. This is why I think "the cloud" is a bad way of thinking about it. Linode and rackspace are completely different companies... one of them screwing it up is not going to effect the other.
>> I think that with modern snapshots, it would be interesting to create a 'cloud backup' service where you have a 'do not allow overwrite before date X' parameter, and it wouldn't be that hard to implement, but I don't know of anyone that does it.
I test backups for F500 companies on a daily basis (IT Risk Consulting) - this would be missing the point really, the business process around this problem is really moving towards live mirrored replication. This allows much faster recall time, and also mitigates many risks with the conventional 'snapshot' method through either tapes, cloud, etc.
I think that with modern snapshots, it would be interesting to create a 'cloud backup' service where you have a 'do not allow overwrite before date X' parameter, and it wouldn't be that hard to implement, but I don't know of anyone that does it.
I've said it elsewhere, but it bears repeating: RAID is about availability first and foremost. The fact that it happens to preserve your data in the case of one form of hardware failure is a side effect of its primary goal.
This is certainly a monumental fuckup, but these things inevitably happen even with better development practices, this is why you need backups, preferably daily, and as much separation of concerns and responsibilities as humanly possible.
Anecdote:
I am working for a company that does some data analysis for marketers aggregated from a vast number of sources. There was a giant legacy MyISAM (this becomes important later) table with lots of imported data. One day, I made some trivial looking migration (added a flag column to that table). I tested it locally, rolled it out to staging server. Everything seemed A-OK until we started migration on the production server. Suddenly, everything broke. By everything, I mean EVERYTHING, our web application showed massive 500-s, total DEFCON1 across the whole company. It turned out we ran out of disk space, since apparently myisam tables are altered the following way: first the table is created with updated schema, then it is populated with data from the old table. MyISAM ran out of disk space and somehow corrupted the existing tables, mysql server would start with blank tables, with all data lost.
I can confirm this very feeling: "The implications of what I'd just done didn't immediately hit me. I first had a truly out-of-body experience, seeming to hover above the darkened room of hackers, each hunched over glowing terminals." Also, I distinctly remember how I shivered and my hands shook. It felt like my body temperature fell by several degrees.
Fortunately for me, there was a daily backup routine in place. Still, several hour long outage and lots of apologies to angry clients.
"There are two types of people in this world, those who have lost data, and those who are going to lose data"
Reading those stories makes me realize how well thought-out the process at my work is:
We have dev databases (one of which was recently empty, nobody knows why; but that's another matter), then a staging environment, and finally production. And the database in the staging environment runs on a weaker machine than the prod database. So before any schema change goes into production, we do a time measurement in the staging environment to have a rough upper bound for how long it will take, how much disc space it uses etc.
And we have a monthly sync from prod to staging, so the staging db isn't much smaller than prod db.
And the small team of developers occasionally decides to do a restore of the prod db in the development environment.
The downside is that we can't easily keep sensitive production data to find its way into the development environment.
When moving data from prod to other environments, consider a scrambler. E.g., replace all customer names with names generated from census data.
I try to keep data having the same form (e.g., length, number of records, similar relationships, looks like production data). But it's random enough so that if the data ever leaks, we don't have to apologize to everybody.
Since your handle is perlgeek, you're already well equipped to do a streaming transformation of your SQL dump. :)
Yep. For x.com I wrote a simple cron job that sterilizes the automated database dump and sends it to the dev server. Roughly, it's like this:
-cp the dump to a new working copy
-sed out cache and tmp tables
-Replace all personal user data with placeholders. This part can be tricky, because you have to find everywhere this lives (are form submissions stored and do they have PII?)
-Some more sed to deal with actions/triggers that are linked to production's db user specifically.
-Finally, scp the sanitized dump to the dev server, where it awaits a Jenkins job to import the new dump.
The cron job happens on the production DB server itself overnight (keeping the PII exposure at the same level it is already), so we don't even have to think about it. We've got a working, sanitized database dump ready and waiting every morning, and a fresh prod-like environment built for us when we log on. It's a beautiful thing.
More than once (over the last few years) I've been doing some important update. I tend to do it the same way.
START TRANSACTION;
--run SQL--
--check results--
COMMIT; or ABORT TRANSACTION;
Of course, if you happen to run into the 1 or 2 MyISAM tables that no one knew were MyISAM, abort doesn't do anything. You've screwed up the data and need a backup.
So you always have to make a backup and check that the tables are defined they way they should be. Nothing is quite as much fun as the first time you delete a bunch of critical data off a production system. Luckily the table was small, basically static, and we had backups so it was only a problem for ~5 minutes.
Remebers me of adifferent story from back the day when I worked around a pretty huge SAP system (as some kind of super user whatever). In one, seemingly trivial update (trivial compared to the complete system upgrade from one version to the higher one that worked without problems) cleansed the database including all purchase orders from the last two days company wide. Ah, and the back-up becamo "unusable", too.
But as far as I know, nobody was fired for this. Because, yes, things like this just can happen. An eventually it got fixed anyway.
Tens of thousands of paying customers and no backups?
No staging environment (from which ad-hoc backups could have been restored)!?!?
No regular testing of backups to ensure they work?
No local backups on dev machines?!?
Using a GUI tool for db management on the live db?!?!?
No migrations!?!?!
Junior devs (or any devs) testing changes on the live db and wiping tables?!?!?!
What an astonishing failure of process. The higher ups are definitely far more responsible than some junior developer for this, he shouldn't have been allowed near the live database in the first place until he was ready to take changes live, and then only on to a staging environment using migrations of some kind which could then be replayed on live.
They need one of these to start with, then some process:
My hypothesis is that it's a game company and all of the focus was on the game code. The lowly job of maintaining the state server was punted off to the "junior dev" just out of school. Nobody was paying attention. It was something that just ran.
They paid the price of ignoring what was actually the most critical part of their business.
I disagree slightly. If you're a game company, your most critical part of the business is the game.
Even if you have a rock-solid database management, backup, auditing etc process, if your game is not playable, you won't have any data that you could lose by having a DB admin mis-click.
Still, not handling your next-most-critical data properly is monumentally stupid and a collective failure of everyone who should have known.
The development environment should not be able to make a direct connection to production. GitHub temporarily deleted their whole prod database because of a config screwup because the dev server could talk to the production db. https://github.com/blog/744-today-s-outage
I wasn't suggesting it should, ideally they'd be on completely isolated machines, and there's no reason it has to connect to production. Just because you use production backups to set up your dev environment, doesn't mean the dev environment should be able to talk to production servers, quite the opposite.
What I'd normally do is have a production server, which has daily backups, copies of which are used for dev on local machines, and then pushed to a dev server with separate dev db which is wiped periodically with that production data (a useful test of restoring backups), and has no connection with the production server or db.
Can't work out why they would possibly be doing development on a live db like this, that's insanity.
I worked for the largest cellphone carrier in my country. I had write permission for the production db (not to all views, though) from my second week there onwards, the first week I used the credentials of the guy training me. The guy training knew the whole thing was wrong, mainly because once he ran a query that froze the db for half day. I was not a developer I was working in the help desk.
> Using a GUI tool for db management on the live db?!?!?
I still use the mysql CLI and have for 10 years plus-or-minus, but I actually use Sequel Pro a lot. If I'm perusing tables with millions of rows, or I want to quickly see a schema, or whatever, it's been a net gain in productivity.
Think of your database in terms of code version control. You want to track the changes that are made to it, adding a column could be a migration, renaming a column could be another.
A migration enables you to track the changes you made and possibly rollback to previous migrations (database states) if ever required.
Also, I can't tell you how many times I've attempted to click on a button on a web form, and it was still loading and the button moved (along with a different button appearing in its place).
Not really. You obviously haven't (yet) done anything like
update t1 set status=0; where status=4;
when you wanted to release (set status to 0) objects that are stuck in state 4, and let all other objects keep their existing statuses.
This is an easy mistake to make on command line. I hate GUIs too but not having one doesn't really help when your fundamental operating model is wrong.
When I was 18 I took out half my towns power for 30 minutes with a bad scada command. It was my summer job before college and I went from cleaning the warehouse to programming the main SCADA control system in a couple weeks.
Alarms went off, people came running in freaking out, trucks started rolling out to survey the damage, hospitals started calling about people on life support and how the backup generators were not operational, old people started calling about how they require AC to stay alive and should they take their loved ones on machines to the hospital soon.
My boss was pretty chill about it. "Now you know not to do that" were his words of wisdom, and I continued programming the system for the next 4 summers with no real mistakes.
I'm interested in knowing some more details regarding the architectural setup and organizational structure that would allow something like this to happen.
It was the small New England town I grew up in outside of Boston. Typically they manage their own power distribution, and purchase power from a utility at a rate fixed by the highest peak usage of the year, which tends to be 1:00 sometime in early August.
I was setting up a system to detect the peak usage and alert the attendant to send a fax (1992) to our local college generator plant to switch over some of their power to us, thus reducing our yearly power bill by a lot.
There was a title grabber in charge of the department, and a highly competent engineer running everything and keeping the crews going. That was my boss. Each summer I was doing new and crazy things, from house calls for bad TV signals from grounding to mapping out the entire grid. Oddly enough there was no documentation as to what house was on what circuit.. it was in the heads of the old linesman. Most of the time when they got a call about power outages they drove around looking at what houses were dark and what were not.
Sometimes the older linesmen would call me on the radio to meet them somewhere, and we'd end up in their back yards taking a break having some beers. I learned a lot from those old curmudgeonly linesmen. They made fun of me non stop, but always with a wink and roaring laughter. Especially when I cut power to half the town.
It's a hard space to break into. The businesses are conservative about any change, and new vendors are fairly un-trusted. Further, it is a bit different of a world than normal software realms, due to crazy long legacy lifetimes. Another problem is that the money isn't as big as you'd think, its a surprisingly small field.
All that being said, there is opportunity, just not easy opportunity. And a huge number of the people in it are boomers, so there is going to be big shake-ups in the next decade or two.
Whoever was your boss should have taken responsibility. Someone gave you access to the production database instead of setting up a proper development and testing environment. For a company doing "millions" in revenues, it's odd that they wouldn't think of getting someone with a tiny bit of experience to manage the development team.
We sell middleware to a number of customers with millions of dollars in revenue who don't have backups, don't have testbeds for rolling out to "dev" before pushing to "prod" and don't have someone with any expertise in managing their IT / infrastructure needs.
My experience is that this is the norm, not the exception.
As a college student, this fucking horrifies me. Is there anyway I can guarantee I don't end up at someplace as unprofessional as this? I want to learn my first job not teach/lead.
The interview advice here is excellent. Ask questions - in the current climate they're hunting you, not the other way around.
Additionally, start networking now. Get to know ace developers in your area, and you will start hearing about top-level development shops. Go to meetups and other events where strong developers are likely to gather (or really, developers who give a shit about proper engineering) and meet people there.
It's next to impossible to know, walking into an office building, whether the company is a fucked up joke or good at what it does - people will tell you.
An interview is a two way street. You too can ask questions, make sure you fit within the team, and that the job is to your liking. That is the time to ask and find out about anything that may be pertinent to your job.
Also, you have a choice of leaving if you don't like the job, and or don't find the practices in place to be any good, or you can fix them.
It doesn't cover every last thing, but a team following these practices is the kind of team you're looking for. Ask these questions at your interviews.
Ask about their development and staging process during the technical interview. Ask about how someone gets a piece of code into production (listen for mention of different environments).
Well you should certainly look for a team that gives you a good impression with their dev process.
But, things are not always perfect, even on great teams. Not saying its normal to destroy your production database! But even in good shops it's a constant challenge to stay organized and do great work. Look for a team that is at least trying to do great work, rather than a complacent team.
The CEO leaned across the table, got in my face, and said, "this, is a monumental fuck up. You're gonna cost us millions in revenue".
No, the CEO was at fault, as was whoever let you develop against the production database.
If the CEO had any sense, he should have put you in charge of fixing the issue and then making sure it could never happen again. Taking things further, they could have asked you to find other worrying areas, and come up with fixes for those before something else bad happens.
I have no doubt that you would have taken the task extremely seriously, and the company would have ended up in a better place.
Instead, they're down an employee, and the remaining employees know that if they make a mistake, they'll be out of the door.
I was in a situation very similar to yours. Also a game dev company, also lots of user data etc etc. We did have test/backup databases for testing, but some data was just on live and there was no way for me to build those reports other than to query the live database when the load was lower.
In any case, I did a few things to make sure I never ended up destroying any data. Creating temporary tables and then manipulating those.. reading over my scripts for hours.. dumping table backups before executing any scripts.. not executing scripts in the middle/end of the day, only mornings when I was fresh etc etc.
I didn't mess up, but I remember how incredibly nerve wracking that was, and I can relate to the massive amount of responsibility it places on a "junior" programmer. It just should never be done. Like others have said, you should never have been in that position. Yes, it was your fault, but this kind of responsibility should never have been placed on you (or anyone, really). Backing up all critical data (what kind of company doesn't backup its users table?! What if there had been hard disk corruption?), and being able to restore in minimum time should have been dealt with by someone above your pay grade.
It was a bunch of different tasks. For some, we did use a read only account. Other tasks (updating top 10 scores, updating the users table with their geo-ip based location etc) required write access.
Just to add some more thoughts based on other comments.. yes a lot of companies do stuff like this, particularly startups. The upside in these situations is that you end up learning things extremely quickly which wouldn't be possible in a more controlled environment. However not having backup and restore working is just ridiculous and I keep shaking my head at how they blamed the OP for this mistake. Unbelievable.
If it helps explain things, the only experience the CEO had before this social game shop was running a literal one-man yogurt shop.
This happened a week before I started as a Senior Software Engineer. I remember getting pulled into a meeting where several managers who knew nothing about technology were desperately trying to place blame, figure out how to avoid this in the future, and so on.
"There should have been automated backups. That's really the only thing inexcusable here.", I said.
The "producer" (no experience, is now a director of operations, I think?) running the meeting said that was all well and good, but what else could we do to ensure that nobody makes this mistake again? "People are going to make mistakes", I said, "what you need to focus on is how to prevent it from sinking the company. All you need for that is backups. It's not the engineer's fault.". I was largely ignored (which eventually proved to be a pattern) and so went on about my business.
And business was dumb. I had to fix an awful lot of technical things in my time there.
When I started, only half of the client code was in version control. And it wasn't even the most recent shipped version. Where was the most recent version? On a Mac Mini that floated around the office somewhere. People did their AS3 programming in notepad or directly on the timeline. There were no automated builds, and builds were pushed from peoples' local machines -often contaminated by other stuff they were working on. Art content live on our CDN may have had source (PSD/FLA) distributed among a dozen artist machines, or else the source for it was completely lost.
That was just the technical side. The business/management side was and is actually more hilarious. I have enough stories from that place to fill a hundred posts, but you can probably get a pretty good idea by imagining a yogurt-salesman-cum-CEO, his disbarred ebay art fraudster partner, and other friends directing the efforts of senior software engineers, artists, and other game developers. It was a god damn sitcom every day. Not to mention all of the labor law violations. Post-acquisition is a whole 'nother anthology of tales of hilarious incompetence. I should write a book.
I recall having lunch with the author when he asked me "What should I do?". I told him that he should leave. In hindsight, it might have been the best advice I ever gave.
So the person who made a split-second mistake while doing his all for the business was pressured into resigning - basically, got fired.
What I want to know is what happened to whoever decided that backups were a dispensable luxury? In 2010?
There's a rule that appears in Jerry Weinberg's writings - the person responsible for a X million dollar mistake (and who should be fired over such a mistake) is whoever has controlling authority over X million dollars' worth of the company's activities.
A company-killing mistake should result in the firing of the CEO, not in that of the low-level employee who committed the mistake. That's what C-level responsibility means.
(I had the same thing happen to me in the late 1990's, got fired over it. Sued my employer, who opted to settle out of court for a good sum of money to me. They knew full well they had no leg to stand on.)
"We make astonishingly fun, ferociously addictive games that run on social networks.
...KlickNation boasts a team of extremely smart, interesting people who have, between them, built several startups (successful and otherwise); written a novel; directed music videos; run game fan sites; illustrated for Marvel Comics and Dynamite Entertainment with franchises like Xmen, Punisher, and Red Sonja; worked on hit games like Tony Hawk and X-Men games; performed in rock bands; worked for independent and major record lables; attended universities like Harvard, Stanford, Dartmouth, UC Berkeley; received a PhD and other fancy degrees; and built a fully-functional MAME arcade machine."
And this is hilarious: their "careers" page gives me a 404:
I am tempted to apply simply to be able to ask them about this. It would be interesting to hear if they have a different version of this story, if it is all true.
One of the things I like asking candidates is "Tell me about a time you screwed up so royally that you were sure you were getting fired."
Let's be honest, we all have one or two.. and if you don't, then your one or two are coming. It's what you learned to do differently that I care about.
And if you don't have one, you're either a) incredibly lucky, b) too new to the industry, or c) lying.
This. We all mess up, but only the best ones will deal with it professionally and learn from it. Sounds like the OP is in that group. He didn't try to hide it, blame it on anyone else, or make excuses. He just did what he could to fix his mistake.
When people say "making mistakes is unacceptable - imagine if doctors made mistakes" they ignore three facts:
1. Doctors do make mistakes. Lots of them. All the time.
2. Even an average doctor is paid an awful lot more than me.
3. Doctors have other people analysing where things can go wrong, and recommending fixes.
If you want fewer development mistakes, as a company you have to accept it will cost money and take more time. It's for a manager to decide where the optimal tradeoff exists.
> If you want fewer development mistakes, as a company you have to accept it will cost money and take more time. It's for a manager to decide where the optimal tradeoff exists.
This is absolutely it, of course it is possible to become so risk averse that you never actually succeed in getting anything done and there are certainly organisations that suffer from that (usually larger ones).
However some people seem to take the view that it is impossible to protect oneself from all risks therefor it is pointless protecting from any of them.
The good news is that usually protecting against risks tends to get exponentially more expensive as you add "nines" therefor having a 99% guarantee against data loss is a lot cheaper than a 99.999% guarantee.
Having a cronjob that does a mysqldump of an entire database, emails some administrator and then does rsync to some other location (even just a dropbox folder) is something that is probably only a couple of hours work.
The aviation industry is the same. Most aviation authorities around the world have a no-fault reporting system so that fixes can get implemented without pilots worrying about losing their job.
In my first year as an engineer, I shut off the production application server. It was an internally-developed app server built in & for an internally-developed language, with its own internally-developed persistence mechanism, and it was my first day working with it. I had several terminals open, and in one I'd logged on to production to check how the data was formatted, so I could get something comparable for my dev environment. Unfortunately I forgot to log off, and at some point I issued the `stop server` command, thinking I was doing it on dev. A few minutes later the project leader, one of the most senior engineers, came over and started yelling at me. Fortunately not much was lost (other than a few users' sessions--not a high-traffic site). But I was really appalled at my mistake and I certainly wondered if I was about to be fired. Since then I've taken steps to (1) keep myself off production as much as possible (2) make a production terminal visually distinct from a non-production one. As a sometime project leader myself now, I also make sure newcomers can easily obtain an adequate development environment, and I don't give out the production login to every junior dev. :-)
One of the things I like asking candidates is "Tell me about a time you screwed up so royally that you were sure you were getting fired."
I asked similar, and agree that it's a really useful question.
I think it's an especially great one for startups, as successful candidates are more likely to come into contact with production systems.
For these positions you not only want people capable of recovering accidents, but also people who have screwed up because, conversely, they've been trusted not to screw up. Those who've never been trusted enough to not damage a system are unlikely to be of much use.
Part of our culture/personality/team fit questions is like this. We also have one that's something like "tell me about a time you failed a commitment (deadline, etc) and how you handled it". Numerous people with over a decade claim perfect records, frequently blaming all those around them as having failed. It's been a real easy way to eliminate candidates, particularly because almost every team member we have gave an example that was within like a month of when they interviewed.
Interesting, though perhaps they misinterpreted the question. Some companies seem to have a culture where you never admit failure and perhaps they assumed that was the answer you were looking for?
All of the failed responses weren't just not admitting failure, but directly blaming others. One example was a team lead in charge of a project where they said the developers missed their deadlines and would "whine" about not having enough details to do their job. Someone who didn't have an example of failing to deliver, but had examples of near misses they were able to save would even be good examples. Playing the blame game, though, just doesn't fit.
"I found myself on the phone to Rackspace, leaning on a desk for support, listening to their engineer patiently explain that backups for this MySQL instance had been cancelled over 2 months ago."
Here's something I don't get: didn't Rackspace have their own daily backups of the production server, e.g. in case their primary facility was annihilated by a meteor (or some more mundane reason, like hard drive corruption)?
Regardless, here's a thought experiment: suppose that Rackspace did keep daily backups of every MySQL instance in their care, even if you're not paying for the backup service. Now suppose they get a frantic call from a client who's not paying for backups, asking if they have any. How much of a ridiculous markup would Rackspace need to charge to give the client access this unpaid-for backup, in order to make the back-up-every-database policy profitable? I'm guessing this depends on 1) the frequency of frantic phone calls, 2) the average size of a database that they aren't being paid to back up, and 3) the importance and irreplacebility of the data that they're handling (and 4) the irresponsibility of their major clients).
Nope, not going to happen. At least one good reason and that is that if Rackspace leak your data via a backup they're going down to the tune of millions.
Yes it would be nice if Rackspace could speculatively create a backup but they'd be dancing on ice doing so.
I really feel sorry for this guy. Accidents happen, which is why development happens in a sandboxed copy of the live system and why back ups are essential. It simple shouldn't be possible (or at least, that easy) for human error to put an entire company in jeopardy.
Take my own company, I've accidentally deleted /dev on development servers (not that major of an issue thanks to udev, but the timing of the mistake was lousy), a co-worker recently dropped a critical table on dev database and we've had other engineers break Solaris by carelessly punching in chmod -R / as root (we've since revised engineers permissions so this is no longer possible). As much as those errors are stupid and as much as engineers of our calibre should know better, it can only takes a minor lack of concentration at the wrong moment to make a major fsck up. Which is doubly scary when you consider how many interruptions the average engineer gets a day.
So I think the real guilt belongs to the entire technical staff as this is a cascade of minor fcsk ups that lead to something catastrophic.
Last year I worked at a start-up that had manually created accounts for a few celebrities when they launched, in a gutsy and legally grey bid to improve their proposition†. While refactoring the code that handled email opt-out lists I missed a && at the end of a long conditional and failed to notice a second, otherwise unused opt-out system that dealt specifically with these users. It was there to ensure they really, really never got emailed. The result?
These mistakes are almost without fail a healthy mix of individual incompetence and organisational failure. Many things - mostly my paying better attention to functionality I rewrite, but also the company not having multiple undocumented systems for one task, or code review, or automated testing - might have saved the day.
Once, a long time ago, I spent the best part of a night writing a report for college, on an Amstrad PPC640 (http://en.wikipedia.org/wiki/PPC_512).
Once I was finished, I saved the document -- "Save" took around two minutes (which is why I rarely saved).
I had an external monitor that was sitting next to the PC; while the saving operation was under way, I decided I should move the monitor.
The power switch was on top of the machine (unusual design). While moving the monitor I inadvertently touched this switch and turned the PC of... while it was writing the file.
The file was gone, there was no backup, no previous version, nothing.
I had moved the monitor in order to go to bed, but I didn't go to bed that night. I moved the monitor back to where it was, and spent the rest of the night recreating the report, doing frequent backups on floppy disks, with incremental version names.
Yeah; I was lucky that my first experience where I could lose data like that (before it was on punched cards) was a nice UNIX(TM) V6 system on a PDP-11/70 that had user accessible DECTAPEs. Because I found the concept interesting, I bought one tape, played around with it including backing up all my files ... and then I learned the -rf flags to rm ^_^.
That was back in the summer of 1978; today I have an LTO-4 tape drive driven by Bacula and backup the most critical stuff to rsync.net, the latter of which saved my email archive when the Joplin tornado roared next to my apartment complex and mostly took out a system I had next to my balcony sliding glass doors and the disks in another room with my BackupPC backups.
As long as we're talking about screwups, my ... favorite was typing kill % 1, not kill %1, as root, on the main system the EECS department was transitioning to (that kills the initializer "init", from which all child processes are forked). Fortunately it wasn't under really serious heavy use yet, but it was embarrassing.
This happened to me once on a much smaller scale. Forgot the "where" clause on a DELETE statement. My screwup, obviously.
We actually had a continuous internal backup plan, but when I requested a restore, the IT guy told me they were backing up everything but the databases, since "they were always in use."
(Let that sink in for a second. The IT team actually thought that was an acceptable state of affairs: "Uh, yeah! We're backing up! Yeah! Well, some things. Most things. The files that don't get like, used and stuff.")
That day was one of the lowest feelings I ever had, and that screwup "only" cost us a few thousand dollars as opposed to the millions of dollars the blog post author's mistake cost the company. I literally can't imagine how he felt.
That is pretty hilarious. I guess you can save a lot of money on tapes, if you do incremental backups only on files that never change.
Personally I felt bad when I deleted some files, that were recovered within the hour, and learned from that experience. But when you create a monumental setback as the OP by simple mistake, that's an issue with people at higher ranks.
I know how you felt. Many years ago when I was a junior working in a casual game company, I were to add a bunch of credits to a poker player (fake money). I forget the where in the SQL clause and added credits to every player in our database. Lucky me it was an add and not a set and I could revert it. Another time I was going to shutdown my pc (a debian box) using "shutdown -h now" and totally forgot that I was in a ssh session to our main game server. I had to call the tech support overseas and tell him to physically turn on the server...
molly-guard -- looks great. I've always aliased my commands so if I accidentally type 'po' while in an ssh session on a production server then it complains. On my local dev box, 'po' is just quicker for me to power off the box. It has saved me at least once.
At one job I went with this scheme for terminal background color: green screen for development, blue for testing, yellow for stage / system test, and red for production. This saved a lot of problems because I knew to be very careful when typing in the red.
Interessting, I have been the only guy in my team who had the exact opposite colours. Green was production and red was testing (i didn't do any dev work, so). I guess that came from me being from produciton. But I should have payed more attention when working with other people on my machine, in hindsight... Luckily, nothing bad ever happened!
I liked a green screen (too much time spent with old terminals) and blue was ok to type on but not as nice . I went with the yellow (warning) and red (serious warning) because they are not very comfortable to type on and most people get the "alert" status given Star Trek.
I have done this on a few servers but found that it always screws up formatting of the lines in bash when they are long and you are hitting up and going back through the history.
Did you change the $PS1 variable? Can you share your config?
Turning off the wrong-server is a thing that bit me before I installed molly-guard. These days that, and similar, is a tweak I apply to all hosts I control.
(molly-guard makes you type in the hostname before a halt/shutdown/reboot command.)
Steve Kemp? Slightly unrelated, but i just want to say i'm (still) a big fan of your old 2004 program window.exe. Very handy for unhiding the odd broken program.
Have a wonderful day! (and i'll definitely look at installing molly-guard on my production debian servers)
I think it's a bit extreme to say he did more good than harm. He might have done some long-term good by having the company re-examine permissions and environments, but he probably did a lot of long-term harm by alienating current and future customers.
There's an argument to be made that the company is doing harm to customers just by existing in such a precarious state. Anything that forces the technical leadership of the company to do the right thing or fail completely is actually better for customers in the long term.
Better that it happened 2 months after backups were canceled than 6 months or later. If you're going to cancel your backups you're begging for disaster.
"But but but but...that item in the expense report is HUUUGE, and what revenue did we get out of having backups lately? Or ever? I say we drop it, nothing could possibly happen."
Some experiences are non-transferable. This identical conversation has taken place millions of times, but noooo: every penny-wise-pound-foolish CEO wants to experience the real thing, apparently.
If you ever notice that your employer or client isn't backing up important data, take a tip from me: do a backup, today, in your free time, and if possible, and again in your free time, create the most basic regular backup system you can.
When the time comes, and someone screws up, you will seem like a god when you deliver your backup, whether it's a 3-month-old one-off, or from your crappy daily backup system.
Well, IANAL. I think you already covered the most important point: store backups on hardware/services under the control of your employer/client.
I would document the backup process and communicate it to my manager/client with a mail like "hey, I set up backups, they are stored at <server>, docs are in the wiki".
Other potential issues: causing unauthorized costs ("who stored 10TB on S3?") or privacy violations, e.g. when working with healtcare or payment data.
I've done this before and I just email it to myself using the company email account. This way nothing leaves the workplace. Also, no financial transaction data was in the db as it was a simple wordpress blog.
If it stored credit card data or other important stuff I'd take a look at what PCI compliance says you have to do for your backups and follow that.
> The CEO leaned across the table, got in my face, and said, "this, is a monumental fuck up. You're gonna cost us millions in revenue".
Yes, it is a monumental fuck-up. You put a button in front of a junior developer that can cost the company millions if he accidentally clicks it and doesn't even have undo.
Mistakes happen, and there should have been better safeguards -- backups, locking down production, management oversight.
But, I actually applaud how he tried to take responsibility for his actions and apologized. Both "junior" AND "senior" people have a hard time doing this. I've seen experienced people shrug and unapologetically go home at 6pm after doing something equivalent to this.
The unfortunate thing here seems to be that he took his own actions so personally. He made an honest mistake, and certainly there were devastating consequences, but it's important to separate the behavior from the person. I hope he realizes this in time and forgives himself.
There are several reasons why you should not feel guilty. The company was asking for trouble and you just happen to be the trigger. These are the three top things that could prevented that incident.
1) A cron job for the manually task you were doing.
2) Not working directly on production.
3) Having daily backups
And this could happened to anybody. After midnight any of us are at junior level and very prone to do this kind of mistakes.
I cannot believe that people still don't have reliable back up in place.
My feeling is this:
If you are in any way responsible for data that is not backed up, you should be fired or resign right now. You should never work in IT, in anyway, ever again. If you are the CEO of a company in a similar state, again, fire your self right now. Vow to never ever run a business again. This is 2013. And guess what? You still can't buy your unique data back from PCWorld. Your data is "the precious".
As for the treatment of this guy, IMHO, his employers were the worst kind of spineless cowards. This was 100% the fault of the management, and you know what? They know it. To not have backups is negligent, and should result in a high up firings. Yet these limp cowards sought to blame this kid. Pure corporate filth of the lowest order. Even the fact he was junior is irrelevant, any one could have done that, more likely a cocky senior taking some short cut. Let me tell you now, I have made a similar cock up, and I think I know it all. But I had backups, and lucky for me, it was out of business hours. Quick restore, and the users never knew. I did fess up to my team since I thought it had direct value as a cautionary tail.
Frankly, I am utterly amazed and gutted that such a thing can still happen. The corporate cowardice is sadly expected, but to not have backups is literally unforgivable negligence.
Yeah, Im quite fundamentalist about data and backups. I'd almost refer to my self as a backup jihadist.
Just wondering, when consulting I usually take care that there are appropriate clauses in the contract to make me not liable. But what is the rule for employees, are they automatically insured?
In Germany there is the concept of "Fahrlässig" (negligence) and "severe negligence". Per law you are already liable if you are just negligent, but it is possible to lower it to severe negligence in the contract. That is my understanding anyway (not a lawyer). Usually I also try to kind of weasel out of it by saying the client is responsible for appropriate testing and stuff like that... Overall it is a huge problem, though, especially if the client has a law department. Getting insurance is quite expensive because it's easy to create millions of dollars in damages in IT.
Before court "standard best practices" can become an issue, too. This worries me because I don't agree with all the latest fads in software development. It seems possible that in the future x% test coverage could be required by law, for example. Or even today a client could argue that I didn't adhere to standard best practices if I don't have at least 80% test coverage (or whatever, not sure what a reasonable number would be).
More responsible, I would say. You expect a junior to make mistakes; the company should be structured to handle that happening.
Though I would look askance at whoever hired a philosophy grad as well, to be perfectly honest. The author admits he didn't have the experience to spot bad practice at the time.
Actually even senior developers or architects make mistakes. Philosophy grad or not, it doesn't matter. That's to be expected.
What's more questionable is:
* Developers have access to the production database from their machine, while it should only be accessible to the front machines within the datacenter.
* Junior developers don't need an access to production machine, only sysops and maybe the technical PM.
* No backup of the production database. WTF???
If they had a hardware failure they would have been in the same shit.
You don't expect a junior to make mistakes, you expect everyone AND everything to make mistakes/fail. Backups and separated environments is the least you could expect from a company earning millions.
Well - to be fair - if the company's practices include developing on the production database and not doing daily/weekly backups, then hiring an inexperienced Philosophy major is the least of their problems.
Also, someone who actually had the development experience and knowledge of better-practice would not have taken that position.
Literally every engineer who didn't think to have a dev database can also be responsible. This is just a case of 'shit happens' since nobody had the sense to not work on, not back up, the live db.
This. Times 1000. I would even go as far as to fire the CTO because if the data is your bread and butter, the you protect the data.
Protecting the data is called a disaster recovery (DR) plan in those big outdated companies that people like to make fun of.
The reason that these companies have a DR plan is to tell the CEO that 'when' (not 'if') all of the data goes away, (a) how long will it take to get it back and (b) how out-of-sync will this data be (2 minutes from freshness? 6 hours?).
I've did a very similar thing after one year working at my company instead of clearing the whole user table I replaced every user information with my account information.
I forgot to copy the WHERE part of the query .....
The only difference is that it was policy to manually do a backup before doing anything on production and the problem was restored in less than 10 minutes. Even if I had forgotten to make a backup manually we had a daily complete backup and an incremental one every couple of hours.
If I were the author, I would rewrite this and reflect on what was actually wrong here. At the end of the day, you resigned out of shame for a serious incident that you triggered.
But the fact that the organization allowed you to get to that point is the issue. Forget about the engineering issues and general organizational incompetence... the human side is the most incredibly, amazingly ridiculous.
I respect your restraint. If I was singled out with physical intimidation by some asshat boss while getting browbeaten by some other asshat via Skype, I probably would have taken a swing at the guy.
Competent leadership would start with a 5-why's exercise. Find out why it happened, why even the simplest controls were not implemented. I've worked in places running on a shoestring, but the people in charge were well aware of what that meant.
The CEO leaned across the table, got in my face, and said, "this, is a monumental fuck up. You're gonna cost us millions in revenue". His co-founder (remotely present via Skype) chimed in "you're lucky to still be here".
This is when you should have left. That's no way to manage a crisis.
Wow, I'm sorry you had to experience that. I'm sure it was traumatic -- or perhaps you took it better than I would have. It must be of some comfort to look back now and realize that you only bore a small part of the blame, and that ultimately a large potion of the responsibility lies on the shoulders of whomever set up the dev environment like that, as well as whomever cancelled backups.
I would love to see some reflection on this story from OP.
What do you think you learned from this experience?
Do you think your response was appropriate?
What would you have done differently?
Are you forever afraid of Prod env?
Many, many , many of us have been in this situation before, whether as 'monumental' or not. So it is interesting to hear how others handle it.
I realize that the dev environment was a recipe for disaster, and I was simply the one to step on the mine .. but I believe my guilt about leaving the company is 'quite right'. Thankfully I'm not forever afraid of Prod env - I still do a lot of risky stuff .. but I always have nightly backups, and other 'recreate the data' strategies in place.
Guilt is a moral concept; when it comes to a run-of-the-mill operations mistake like yours, it does not belong in analysis of its consequences to the business. You are not a robot. You have made and will make mistakes this bad and worse.
Only consequentialist thinking should be the order of the day here; "what do we know that can prevent a similar mistake from hurting our bottom line". In this case backups are the standard, reasonable, well-known practice. Nothing will be improved by a firing or a resignation, by blaming or by shaming.
Insofar as the real root cause of the problem was not addressed, it's a reasonable prediction that any such company eventually joins the deadpool due to similar oversights.
Risk avoidance (decent staging) and risk mitigation (backups) are two mostly orthogonal aspects of risk management. Often, a backup will be a good first step for a totally messed up system.
However, saying that mistakes will aways reach the outer layer to discount the value of risk avoidance is talking about the possibility of risk realisation where what matters is probability.
I'm not trying to discount the value of risk avoidance they are both important and should both be used, but mitigation should always be the priority of the two.
1) When you have neither you should focus on risk mitigation first.
2) Having a great and complex risk avoidance policy in place is a good thing but doesn't mean that you need a lesser mitigation system.
Something similar anyway (was deleting rows from production and hadn't selected the where clause of my query before I ran it).
It was on my VERY FIRST DAY of a new job.
Fortunately they were able to restore a several hours old copy from a sync to dev but the wasn't a real plan in place for dealing with such a situation. There could have just as easily not been a recent backup.
This was in a company with 1,000 employees (dev team of 50) and millions in turnover. I've worked other places that are in such a precarious position too.
At least my boss at the time took responsibility for it - new dev (junior), first day, production db = bad idea.
"The implications of what I'd just done didn't immediately hit me. I first had a truly out-of-body experience, seeming to hover above the darkened room of hackers, each hunched over glowing terminals."
Holy crap. I know that _exact_ same feeling. I had to laugh. I know that out-of-body feeling all too well.
I worked at a small web hosting company that did probably £2m in revenue a year in my first programming job. They had me spending part of my time as support and the other part on projects.
After about 3 or so months they took me took me out of support and literally placed my desk next to the only full time programmer that company had.
They made all changes direct on live servers and I'd already raised this as a concern and now that became my full time job it was agreed that I'd be allowed to create a dev environment.
Long story short, I exported the structure of our MySQL database and imported it into dev. Some variable was wrong so it didn't all import, so I changed the variable, dropped the schema and back to redo.
Yeah that was the live database I just dropped. After a horrible feeling that I can't really explain I fessed up. I dropped it during lunch so it took about two hours to get a restore.
The owner went mad but most other people were sympathetic, telling me their big mistakes and telling me thats what backups were for.
The owner was going crazy about losing money or something and the COO pulled me into a room. I thought I was getting fired but he just asked me what happened and said "yeah we all make mistakes, thats fair enough, just try not to do it again".
I was then told to get on with it and it must have took me a day to finish what would have taken me an hour but I done it and now we had a process and a simple dev environment. I lasted another two years there. I left over money.
I used to be a web freelance web developer/tech guy with one client, a designer. What made me quit was an incident where his client's Wordpress site hadn't been moved properly to the new hosting. (not by me)
The DB needed to be searched and replaced to remove all the old urls. After doing so, the wp_options cell on the production site holding much of the customizations kicked back to the defaults for the theme, the serialized data format being used was sensitive to brute DB-level changes.
I had talked to my client before about putting together a decent process including dev databases, scheduled backups, everything needed to prevent just such a screwup, but he waffled. Then blamed me when things went wrong.
I'd had enough and told him to do his own tech work, leaving him to fix his client's website himself. Being that I didn't build it, I didn't know which settings to flip back. I left freelance work and never looked back.
People and companies do this all the time, refuse to spend the time and money ensuring their systems won't break when you need them the most, then scapegoat the poor little technician when it does.
I'd like to say the answer is "don't work in such environments," but there's really no saying that it won't be this way at the next job you work, either.
I certainly wouldn't internalize any guilt being handed down, ultimately it's the founders' jobs to make sure that the proper systems are in place, after all, they have much more on the line than you do. Count it a blessing that you can just walk away and find another job.
I agree with the comments here that spread the blame past this author.
I manage a large number of people at a news .com site and know that screw-ups are always a combination of two factors: people & systems.
People are human and will make mistakes. We as upper management have to understand that and create systems, of various tolerance, that deal with those mistakes.
If you're running a system allowing a low-level kid to erase your data, that was your fault.
I'd never fire someone for making a stupid mistake unless it was a pattern.
Who asks junior engineer to develop directly on live systems with write access and no backup? Are you kidding me?
Edit: No one ever builds a business thinking about this stuff, until something like this happens. There are people who have learned about operations practices the hard way, and those who are about to. They hung the author out to dry for a collective failure and it shows that this shop is going to be taught another expensive lesson.
I'm with everyone else in this thread: you screwed up but in reality it is EXPECTED.
Do you know why I have backups? Because I'm not perfect and I know one day I will screw up and somehow drop the production database. Or mess up a migration. Or someone else will. This is stuff that happens ALL THE TIME.
Your CEO/CTO should have been fired instead. It is up to the leadership to ensure that proper safeguards are in place to avoid these difficult conversations.
Whoever a) gave production db access to a "junior" engineer and b) disabled backups of said database is at fault. I hope the author takes this more as a learning experience of how to (not) run a tech department than any personal fault.
Someone who has to use a GUI to manage a db at a company of that scale shouldn't have access to prod
Let me make it really simple: Anything that happens in a company is always, always management's fault. The investors hire the management ream to turn a pile of money into a bigger pile of money, and if management fails, it is management's fault, because it can do whatever it needs to do (within the law) to make that happen. That they failed to hire, train, motivate, fire, promote, follow the law, develop the right products, market them well, ensure reliability, ensure business sustainability, ensure reinvestment in research and development, and ultimately satisfy investor, is their fault, and they further demonstrate their failure by not taking responsibility for their own failure and blaming others.
Ah, I remember being called away from my new year holiday when an engineer dropped our entire database.
This happened because they didn't realise they were connected to the production database (rather than their local dev instance). We were a business intelligence company, so that data was vital. Luckily we had a analysis cluster we could restore from, but afterwards I ensured that backups were happening... never again.
(Why were the backups not already set up? Because they were not trivial due to the size of the cluster and having only been CTO for a few months there was a long list of things that were urgently needed)
This brings to mind on of my common responses: if it's not important enough to back up - it's not important!
It may be expensive, either in complexity, costs of storage/services, etc, but it's a necessity.
I'm curious about many of the comments in this thread - why are people logging in as table owners? It's not too difficult (for talented data-driven companies) to create roles or accounts that, while powerful, still make it difficult to drop a table and such.
>I was 22 and working at a Social Gaming startup in California.
>Part of my naive testing process involved manually clearing the RAIDS table, to then recreate it programatically.
>Listening to their engineer patiently explain that backups for this MySQL instance had been cancelled over 2 months ago.
"The CEO leaned across the table, got in my face, and said, "this, is a monumental fuck up. You're gonna cost us millions in revenue".
What. The. Fuck.
The LAST person I would blame is the brand new programmer. They don't backup up their production database? If it wasn't this particular incident it would have been someone else, or a hardware failure.
I was working two years ago in very successful, billion dollar startup. All developers had production access, but then, if you didn't know what you were doing, you would not be working there. Also, we didn't routinely access production and when we did, mostly for support issues on which we rotated, we did through 'rails console' environment that enforced business rules.
In theory you could delete all data, but only in theory, and even then, we could restore it with minimal downtime.
I think it is obvious that CEO/CTO are the one to be held responsible here.
To add to this. I work again in billion dollar company (I think, they are really big0, I don't work on their main property, I have production db access. This is something senior developers definitely should have access to, but also with great privilege comes appropriate responsibility.
I routinely run reports, and sometimes I would wipe spammer out that passed our filters etc.
Your CEO was correct. He should have also said the same thing to the guy who cancelled backups as well...and the guy who never put in place and periodically tested a disaster recovery plan. So much fail in this story, but mistakes happen and I've had my share as well.
I once (nah twice) left a piece of credit card processing code in "dev mode" and it wasn't caught until a day later costing the company over 60k initially. Though they were able to recover some of the money getting the loss down to 20k. Sheesh.
Sounds to me like this operation was second rate and not run professionally. If this sort of incident is even able to happen, you're doing it wrong. Maybe it's just my experience with highly bureaucratic oil and gas companies, but the customer database has no backup for 2 months?!?!?!?!?!?!
That is asinine. What would they have done if they couldn't pin it on a junior engineer? A disk failure would have blown them out of the water. I think he did them a favor, and hopefully they learned from that.
Wow. This reminds me of a time in which I used to work for a consulting agency. It was back in 2003 and I was working on a some database development for one of the company's biggest clients. One day, I noticed the msdb database had a strange icon telling me it was corrupted. I went onto MSDN and followed some instructions to fix it and, BAM, the database I was working for months on was gone (I was running SQL Server 2000 locally where this all happened and I was very junior as a SQL developer). I was silently freaking out knowing this could cost me my job. I got up from my desk and took a walk. On that walk, I contemplated my resignation. When I got back from my walk, a thought occurred to me that maybe the database file is still there (I had zero clue at the time that msdb's main purpose was just cataloguing the existing databases among other things). I did a file search in the MSSQL folders and found a file named with my database's name. So, that day I learned to attach a database, what msdb's role is, and to make sure to take precautions before making a fix! However, OP's post shows that this company had no processes in place control levels of access or disaster recovery. Show the company's faults more than OP's.
This was clearly a lack of oversight and sound engineering practices.
Who cancelled the backups? Why were they cancelled? Was it for non-payment of that service?
---
I worked for an absolutely terrible company as Director of IT. The CEO and CTO were clueless douchebags when it came to running a sound production operation.
The CTO would make patches to the production system on a REGULAR basis and break everything, with the statement "that's funny... that shouldn't have happened"
I had been pushing for dev|test|prod instances for a long time - and at first they appeared to be on-board.
When I put the budget and plan together, they scoffed at the cost, and reiterated how we needed to maintain better up-time metrics. Yet they refused to remove Dave's access to the production systems.
After a few more outages, and my very loud complaining to them that they were farking up the system by their inability to control access - I saw that they had been hunting for my replacement.
They were trying to blame me for the outages and ignoring their own operational faults.
I found another job and left - they offered me $5,000s to not disparage them after I left. I refused the money and told them to fark off. I was not going to lose professional credibility to their idiocy.
I think that everyone does this at some point in their career. Don't let this single event define you. The most important thing to ask yourself is what was the lesson learned...not only from your standpoint but also from the business'.
In addition, to heal your pain its best to hear that you're not the only one who has ever done this. Trust me, all engineers I know have a story like this. (Please share yours HN - Here I even started a thread for it: http://news.ycombinator.com/item?id=5295262)
Here is mine:
When I worked for a financial institution my manager gave me a production level username and password to help me get through the mounds of red tape which usually prevented any real work from getting done. We were idealists at the time. Well I ended up typed that password wrong, more than 3 times...shit, I locked the account. Apparently half of production's apps were using this same account to access various parts of the network. Essentially, I brought down half our infrastructure in one afternoon.
Lesson learned: Don't use the same account for half your production apps. Not really my fault :).
If you want to see monumental screw-up, look at knight capital group (they accumulated a multi billion dollar position in the span of minutes, losing upwards of $440M, and ended up having to accept a bailout and sell itself to GETCO):
Good lord, that's unbelievable! If millions of dollars are riding on a database, they should have spent a few thousand to replicate the database, run daily backups and maintain large enough rollback buffers to reverse an accidental DROP or DELETE.
We've all screwed up at various times (sometimes well beyond junior phase), but not to have backups.... That's the senior management's fault.
Agreed, but if the manager took responsibility for this he or she probably would be fired. Still, it is the only way to be; otherwise, you're not a real leader.
I found myself doing very much this my very first day on the job working for a software startup.
We had a Grails app that acted as a front end for a number of common DB interactions, which were selected via a drop down. One of these (in fact, the default) action was titled "init DB". Of course, this would drop any existing database and initialize a new one.
When running through the operational workflow with our COO on the largest production database we had, I found myself sleepily clicking through the menu options without changing the default value. I vividly remember the out of body experience the OP describes, and in fact offered to fire myself on the spot shortly thereafter.
It's fun to laugh about in hindsight, but utterly terrifying in the moment - to say nothing of the highly destructive impact it had on my self confidence.
How did the company deal with the loss of that database? Did they actually have backups, and just restored the data? Did they reconstruct the data from other sources?
In our case we had periodic backups, and together with filesystem logs were able to restore most of the data. However, we were hosting highly sensitive data and the work being done was time critical. The downtime was therefore not popular with our clients, who were losing ~$15k per hour offline.
This article sounds so incredible to me, I think I might have been holding my breath reading it. These are two major mistakes that the company is responsible for, not the author. Why would they let anyone in on the production password and do direct queries onto that database instead of working on a different environment, it's laughable that they sent this to their customers admitting their amateurism. Secondly, no backups? At my previous project, a similar thing happened to our scrum master, he accidently dropped the whole production database in some kind of the same situation. The database was back up in less than 10 minutes with an earlier version. It's still a mistake that should not be possible to make, but when it happens you should have a backup.
Oh dear... I once logged into the postgresql database of a very busy hosted service in order to manually reset a user's password. So I started to write the query:
UPDATE principals SET password='
Then I went and did all the stuff required to work out the correctly hashed and salted password format, then finally triumphantly pasted it in, followed by '; and newline.
FORGOT THE WHERE CLAUSE.
(Luckily, we had nightly backups as pg_dump files so I could just find the section full of "INSERT INTO principals..." lines and paste in a rename of the old table, the CREATE TABLE from the dump, and all the INSERT INTOs, and it was back in under a minute - short enough that anybody who got a login failure tried again and then it worked, as we didn't get any phonecalls). It was a most upsetting experience for me, however...
While i fully agree with the position of author not being responsible entirely, i find it hard to believe it happened the way it appears to be.
It could be but there are bunch of loopholes. - I can believe that he was lousy enough to click no "delete" on users table. I can believe that he when the dialog box asked "are you sure you want to drop this table" he clicked yes. I can believe that after deleting he "committed" the transaction. But what i can't believe that the database let him delete a table which was base for every other table implemented by a Foriegn key constraint ? It could be argued that due to efficiency they hadn't put constraints on the table but it's hard to digest.
Probably the story is some what tailored to fit to a post.
Using a relational database as a flat data store? Super bad.
Honestly...I think this company deserved what they got. Good thing the author got out of there. Hopefully in their new position they will learn better practices.
Users is a bit of a core table in most applications. If they were using the relational database as it should be used there would be references to the user table elsewhere in the database.
If you tried to delete the table, it would fail stating that a deletion would violate the constraints assuming you didn't have deletions cascade automatically (which would be equally bad).
On the other hand (and it probably happened here) there will be one table with all sorts of data bolted on.
So say you want a user to have multiple pieces of armor (following the spirit of this post). You should have an armor table and a user to armor many to many table. But instead you just add an Armor column to the user record and create a new user record (with a the same username for example but with a different unique artificial key) with the new piece of armor in the armor column. Then to retrieve it you just select armor where username = whatever and iterate through the list. Adds and deletions are just as easy. So, why not? Well, duplication of data, for one thing. And no referential integrity protection for another. Delete a username and everything is deleted. Forget a where clause and you are sunk.
Ah I see.. I misunderstood the first time around. I thought you meant to store the user table in a flat file. Thank you for the explanation. That reminds me, I need to convert to Innodb one of these days.
Although the author of the post obviously did a huge mistake, he is far from being the actual responsible for the problem that follows his mistake. It's the job of the CTO to make sure no one can harm the company main product this way, accidentaly or not.
He could never write code against the production database when developing new features. And if he was doing it, it wasn't his fault, considering he was a junior developer.
And who the hell is so stupid to don't have any recent backup for the database used by a piece of software that provides millions of revenue?
In the end, when you do such a shity job protecting your main product, shit will eventually happens. The author of the post was merely a agent of destiny.
I dont think this is author's fault. These kind of human mistakes are more than common. It is said that the top management actually assigned the blame to this young man. This was an engineering failure.
I can understand what this person must have gone through.
I had a similar situation when collaborating with a team on a video project during a high school internship. Somehow I managed to delete the entire timeline accounting for hours of editing work that my boss had put in. To this day I don't know how it happened, I just looked down and all the clips were gone from the timeline. In the end, I think we found some semblance of a backup, and at least we didn't lose the raw data/video content, but I can relate to the out-of-body experience that hits you when you realize you just royally screwed up your team's progress and there's nothing you can do about it.
Every engineer's worst nightmare. I've worked at a one of the biggest software companies in the world, and I'm working on my own self-funded one-person startup: the panic before doing anything remotely involving production user data is still always nerve-wracking to me.
But agree with everyone's assessments here of the failure of the whole company to prevent this. A hardware failure could have just as likely have wiped out all their data. If you're going to cut corners with backing up user data, then you should be prepared to suffer the consequences.
Thanks for sharing this. Took real guts to put it out there.
If your senior management/devs are worth anything, they were already aware that this was a possibility. There is no excuse for what ostensibly appears to be a total lack of a fully functioning development & staging environment--not to mention any semblance of a disaster recovery plan.
My feeling is that whatever post-incident anger you got from them was a manifestation of the stress that comes from actively taking money from customers with full knowledge that Armageddon was a few keystrokes away. You were just Shaggy pulling-off their monster mask at the end of that week's episode of Scooby Doo.
Your response should have been: "With all due respect sirs, I agree that I am still lucky to be here, that the company is still here being that it's so poorly managed, that they cancelled their only backups with rackspace, that they had no contingency plans, and that you were one click from losing millions of dollars--in your estimate. It makes me wonder what other bills aren't being paid and what other procedures are woefully lacking. I will agree to help you though this mess and then we should all analyze every point of failure from all departments, and go from there."
That wasn't your failure per se.
But the failure of pretty much everyone above you.
That they treated you like that after the fact is pretty shitty.
In hindsight I'd say that you are much better off by not being there, where you would learn bad practices.
No Stage Environment.
Proactively Cancelled Backups on a Business Critical System.
Arbitarily implementing features 'because they have it' rather than it having some purpose in the business model.
No Test Drills of disaster scenarios.
The list goes on.
As I say, and you probably realise now, that you are lucky to no longer be there.
This is not your fault. Not really. And it's galling that the company blamed the incident on the workings of a 'junior engineer'. There was NO DATABASE BACKUP! For Christ's sake. This is live commercial production data. No disaster recovery plan at all. Zilch. And to make matters worse, you were expected to work with a production database when doing development work. This company has not done nearly enough to mitigate serious risks. I don't blame you for quitting. I would. I hope you have found or manage to find a good replacement role.
Know a lot of others have said it, but no production backups? Blame a junior dev for a mistake that almost 100% of the people I've worked with have made at some point or another (including me)? I feel horrible for the author, it's sickening the way he was treated. Now they'll just move on, hire another junior, never mention this, and guess what? The next guy will do the same thing and there probably still aren't any backups. Didn't learn anything, well, other than how easy it is to blame one person for everyone's failure.
A lot of people have said it before on here but really?! The company is blaming on person, whilst yes it was technically his fault, why in the first place was he allowed on the production database and why was the company keeping very regular backups of all this mission critical data.
If the company saw that the data contained in this live database was so critical you would have thought that would not have given the keys to everyone and that if they did, they would at least make sure that they can recover from this, and fast.
While working for a large computer company in the late 90s, I joined a team that ran the company store on the web. The store used the company's own e-commerce system, which it was also selling.
The very first day, at home in the evening, I went to the production site to see if I could log in as root using the default password. Not a problem.
Anyone with any experience with the product could easily have deleted the entire database. I immediately changed the password and emailed the whole team.
- Even having read access to that table with customer info...
You are hardly responsible. Yeah you fucked up badly, but everyone makes mistakes. This was a big impact one and it sucks, but the effect it had was in now way your fault. The worst-case scenario should have been two hours of downtime and 1-day-old data being put back in the table, and even that could have been prevented easily with decent management.
I still remember the all-consuming dread I felt as an intern when I ran an UPDATE and forgot the WHERE clause. I consider it part of the rite of passage in the web developer world. Kind of like using an image or text in a dev environment that you never expect a client to see.
Luckily the company I was at (like any rational company) backed up their db and worked in different environments, so it was more of a thing my coworkers teased me for than an apocalyptic event.
I'm. a little worried for OP because he obviously took the time to keep the characters in this article anonymous, but we now know who this CEO with ridiculous behavior must have been, since we know the name of OP's former company from his profile. Not sure what said former CEO of Noe acquired company can do, but this is the kind of thing I fear happeneing toe when/if I write something negative about a past employer, being a blogger myself.
This happened somewhat in reverse to someone I worked with. He was restoring from a backup. He didn't notice the "drop tables" aspect, assuming, as one might, that a backup would simply supplement new stuff rather than wipe it clean and go back in time to a few weeks ago.
He is (still) well-liked, and we all felt sick for him for a few days. Our boss had enough of a soul to admit that we should have had more frequent backups.
Wow, that's terrible. Mistakes happen, and for the notion of 'blame' to surface requires some monumentally incompetent management... the exact kind that would have their junior programmers developing against an un-backed-up production database.
The immediate takeaway from a disaster should always be 'How can we make sure this doesn't happen again?' not 'Man, I can't believe Fred did that, what an idiot.'
LOL a gaming startup i worked for in 2010 had the same fuckup! But nobody was fired or quit, there was just a total anger around the place for a few days, and almost all data was eventually recovered. The startup still flopped in about one year after that with ever falling user retention rates - the marketplace was more and more flooded with those more and more similar games.
The CEO leaned across the table, got in my face, and said, "this is a MONUMENTAL fuck up."
It certainly was -- on multiple levels, but ultimately up at the C-level. Blaming a single person (let a lone a junior engineer) for it just perpetuates the inbred culture of clusterfuckitude and cover-ass-ery which no doubt was the true root cause of the fuck-up in the first place.
I think all developers have to do something like this at some point to get the compulsion I have which is backups to the extreme. I can never have enough backups. Any DROP/ALTER type change I make a backup. (And also learned to pretty much never work on a production db directly, and in the event I need to, doing a complete test of a script in staging first...)
He worked at a company stupid enough to test on the prod databases without tools to safely clear them. The former is stupid, the later is REALLY stupid.
This is a multi-layer failure and almost none of the blame falls on him. Stupid compounded stupid, and this guy did nothing more than trip over the server cord several people who knew better stupidly ran past his cube exit.
Stuff like this happens. The best thing to prevent something like this is to completely sever the line between production and development. I've worked with companies that work directly on the production database. It's horrible. How can the person in charge of managing the workflow expect something like this not to happen eventually?
Things like this fall on the shoulders of the team as a whole. Certainly a tough pill to swallow for a junior engineer, but a good more senior developer or PM should've also realized you were working on prod and tried to remedy that situation. Humans are notoriously prone to fat fingering stuff. Minimize risk where ever you can!
I think it is entirely clear from the writing that the author is a humble being. I feel sorry for him, from the writing it seems he is a much better person and engineer than most of the others at that company, pointing fingers at him.
The guy may be absentminded, but that is a trait of some of the brighest people on earth.
This company sucks. You are out of college and doing the first job. Are these stupid enough to give you direct access to production database? if they are making millions in revenue where was there DBAs? Obviously the management got what they deserved. Its unfortunate that it happened through you.
They should reward him. Seriously, anyone who exposed such a huge weakness deserves a reward. He limited the damage to only 10k users' data loss. With such abysmally crappy practices the damage would happen anyway only perhaps with 30k users and who knows what else instead of a mere 10k.
This was not a junior engineer's fault, but the DBA's fault. Any company should be backing up their database regularly, and then testing the restores regularly. Also don't give people access to drop table's etc. This was a very poor setup on the part of the company/DBA not the engineer.
Bold on your part to own up and offer a resignation. (he "higher ups" should have recognized that and not accepted it). From the movie, "Social Network" http://www.youtube.com/watch?v=6bahX2rrT1I
Wow, that's quite a story.
If your company is ever 1 button press away from destruction. Know that this will eventually happen.
I'm quite surprised stuff like this hadn't happened earlier. When I am doing development with a database I will quite often trash the data when writing code.
This has been stated by others, but it's not the author's fault. It's totally idiotic for a database like that not to have been regularly backed up. At worst, this should have been only a couple hours of down time while the database was restored.
It's an organisational failure if a junior employee can bring down the company in a few clicks. No backups, testing on the production database, this is no way to run a company. Feel sorry for the guy who made a simple mistake.
So many structures in life are based around 'not fucking up.'
We protect our assets & our dignity as if they mean anything; and yet at the end of the day nobody knows what the fuck is going on.
this is insanity! it was already pointed out in comments but i still can't believe a company that mature (actually having 1000's of users and millions in revenue!!!) would omit such a basic security precaution.
Giving [junior?!] developers free reign in production database and no backups????
seriously, the CTO should have been fired on the spot instead of putting blame on developer.
no matter how careful you are (i'm extremely careful) when working with data, if you're working in dev/qa/uat/prd sooner or later someone on dev team will execute on wrong environment.
It's not your fault.
it's the fault of the person who cancelled backups, the person who didn't check that backups are being created, the "senior" people who let you work on the production database.... etc.
It was a mistake, but not huge. You should never have not have had backups, and that wasn't your responsibility. + their should have been a dev instances and a proper coding environment.
This is the most ridiculous thing ever. Why weren't there backups? Sure, the author was the one who "pulled the trigger" but the management "loaded the gun" by not making sure there were back-ups.
I think it's admirable that you stayed long enough to help fix everything before quitting, despite it being rough -- even though, as others have said, others screwed up even bigger than you did.
In my opinion, it's the fault of the whole organization, or at least the engineering team, for making it so likely that something like this would happen.
My name is Myles. I read this and felt like I was looking into a crystal ball. Fortunately, my work doesn't require I interact with the production database (yet). Gulp.
I find it disgusting that the "game designers" are the so called overlords. Fuck them. If you're a developer and a gamer then you're practically a game designer. What ever "education" they had is bullshit. You can go from imagination to reality with just you alone. And perhaps an artists to do the drawing. All those "idea" fuckers a.k.a game designers are just bullshitters.
And yeah this wasn't your fault. It was the CTO's fault. He holds responsibility.
"They didn't ask those questions, they couldn't take responsibility, they blamed the junior developer. I think I know who the real fuckups are."
This could've happened anyone. It's a huge shame for those in charge not for you. Any business letting such operations happen without having backups or proper user-right-management should consider why they still exists, if they really make huge amounts of many as you mentioned.