How I Fired Myself

bguthrie · on Feb 27, 2013

More than anything else, this describes an appalling failure at every level of the company's technical infrastructure to ensure even a basic degree of engineering rigor and fault tolerance. It's noble of the author to quit, but it's not his fault. I cannot believe they would have the gall to point the blame at a junior developer. You should expect humans to fail: humans are fallible. That's why you automate.

potatolicious · on Feb 27, 2013

More than that, it's telling that the company threw him under the bus when it happened. I've been through major fuckups before, and in all cases the team presents a united front - the company fucked up, not an individual.

Which is, if you think about it, true, given that the series of events leading up to the disaster (the lack of a testing environment, working with prod databases, lack of safeties in the tools used to connect to database, etc...).

The correct way to respond to disasters like this is "we fucked up", not "someone fucked up".

damm · on Feb 27, 2013

When things like this happen, you have to realize there is more than one 'truth'.

There is the truth that here is someone who truncated the users table and because of that it caused the company great harm.

Here's another 'truth'.

1. The company lacked backups 2. The junior developer was on a production database.

Note: I'm from the oldschool of sysops who feel that you don't give every employee the keys to your kingdom. Not every employee needs to access every server, nor do they need the passwords for such.

3. Was there a process change? I doubt it, likely they made the employee feel like a failure every day and remind him of how he can be fired at any moment. So he did the only thing he could do to escape: Quit!

Horrible and wrong, if there was a good ethics lawyer around he would say it smells ripe of a lawsuit.

... That said, that lawyer and lawsuit won't fix anything.

bguthrie · on Feb 27, 2013

The problem isn't who has the keys, it's how they're used. I don't care as much if a junior developer has the prod password; I care more about building an engineering and ops team that understands that dicking around with the prod database isn't okay. Sysops and DBAs are fallible too--I've seen a lot of old school shops that relied heavily on manual migration and configuration. Automate, test, isolate and expect failure!

geoka9 · on Feb 28, 2013

And, most importantly, shake off the "agile" ethos when doing DB migrations. Just forget they exist and triple-check every character you type into the console.

geirhe · on March 2, 2013

What, exactly, do you think the "agile" ethos is? We are agile, but that does _not_ mean that we don´t triple-check everything we type into production consoles. It does, however, mean that everything we do in production has been done before in at least one test environment. Why would we want to forget about doing that?

JPKab · on Feb 28, 2013

The fact that the cheap ass idiots at the company had cancelled their backup protection at Rackspace, and lacked any other form of backup is just complete incompetence.

If it hadn't been a junior developer who nobody noticed or cared was using the prod DB for dev work, it would have been an outright failure. DBs fail, and if you don't have backups you are not doing your damn job.

The CEO should be ashamed of himself, but the lead engineers and the people at the company who were nasty to this guy should all be even more ashamed of themselves.

gnur · on Feb 27, 2013

That's why you don't give out the keys to people that know how to act responsible.

damm · on Feb 27, 2013

Don't you mean that's why you don't give keys to people that do not know how to act responsibly?

Giving keys to irresponsible people seems irresponsible ;)

krob · on Feb 28, 2013

That's why you avoid IT like the plague and use Heroku, that why when someone Fubar's you can blame it on amazon web services :) j/k

j_baker · on Feb 27, 2013

I've been through major fuckups before, and in all cases the team presents a united front - the company fucked up, not an individual

You should consider yourself very lucky. Or very savvy at knowing which companies to avoid.

silverbax88 · on Feb 28, 2013

I have to chime in and completely agree. Very lucky. Most people who survive for years at companies have learned to either stay out of sight, or navigate the Treacherous Waters of Blame whenever things go wrong.

This is actually one of the things most employees who have never been managers don't understand.

blablabla123 · on Feb 28, 2013

Your comment makes me think. Are you implying that this is a good practice?

I mean, in fact I do something similar. At our company also a lot of stuff goes wrong. Somehow it surprises me that there was no major fuckup yet. But I do realize that I need to watch out all times that blame never concentrates on me.

It is so easy to blame individuals, it just suffices to have participated somehow in a task that fucked up. Given that all other participants keep a low profile, one needs to learn how to defend/attack in times of blame.

silverbax88 · on Feb 28, 2013

I absolutely do NOT think it is a good practice. I think it is what lazy companies full of people afraid to lose their jobs do. I think it's most companies.

The reality is that fear is a greater motivator than any other emotion - over anger, sorrow, happiness. So companies create cultures of fear which results in productivity (at least a baseline, 'do what I need to do or not get fired' productivity), but little innovation and often at the expense of growth.

Plus, it's just hell. You want to do great things, but know you are stepping into the abyss every time you try.

ahoyhere · on Feb 28, 2013

You (and the other commenters with similar strategies) are wasting productive years of your life at jobs like these. You should go on a serious job hunt for a new position, and leave these toxic wastelands before they permanently affect your ability to work in a good environment.

j_baker · on Feb 28, 2013

As soon as you find an environment where no one ever plays the blame game, let me know.

potatolicious · on Feb 28, 2013

I wish I could say I have some sort of prescience about bad working environments. In reality I'm just good at getting out while the gettin' is still good.

sdoering · on Feb 28, 2013

Oh yes, I second that. Feel really, really lucky...

oseibonsu · on Feb 27, 2013

I believe the company may be http://www.klicknation.com/games/ Can anyone confirm?

amethyst · on Feb 27, 2013

From the author's page, clicking "PAST" shows the following:

I've worked as a Software Engineer at TechStarsNYC, Klicknation(acquired), Shelby and I'm currently working on Followgen

So, yes.

podperson · on Feb 28, 2013

Apparently their logo is ";-)". At first I thought they just had this annoying tendency to overuse the wink emoticon and found it a bit creepy. Then I saw it all over their menus and found it a bit creepy. Then I realized they appeared to be using it as their logo and I found it creepy.

seanfallon · on March 4, 2013

Well, I will never work there. Ever.

trotsky · on Feb 28, 2013

When you think about it it's almost a logical certainty that he'd take the fall. Any company collectively able to understand that the actual failure was inadequate safeguards would have been able to see it coming and presumably would have prevented it from happening. If you're so inexperienced that you expect no one will ever make a mistake you'll obviously assume that the only problem was that someone made one.

It's somewhat fascinating to hear that a company that damaged managed to build a product that actually had users. And I thought my impression of the social gaming segment couldn't go any lower.

ptaipale · on Feb 28, 2013

Indeed. The news here is not that a junior developer made an error with SQL db (everyone makes them, eventually), nor that the company did not have proper safeguards against problems (this happens far too often). The news is that despite such basic management incompetence, the company had been able to get a large number of paying customers.

memracom · on March 1, 2013

Just think, the company spent millions of dollars training this guy in protecting production data, and then instead of treating him like gold, and putting him in charge of fixing all the weaknesses that allowed this to happen, they pushed him out of the company. Stupidity beyond measure.

Not surprised that this was a gaming company because lack of teamwork seems to be endemic in that industry.

franze · on Feb 28, 2013

    The correct way to respond to disasters like this is "we fucked up", not "someone fucked up".

can't agree more! a company that starts playing the blame game - and even communicates this to the outside a) looks unprofessional b) is unprofessional c) poisons the corporate culture

its a fail on every possible level, the technical part beeing only minor.

coverband · on Feb 27, 2013

Exactly this.

sc0rb · on Feb 28, 2013

These kind of comments need to be down voted for adding no value.

tomp · on Feb 28, 2013

Now there are already three comments adding no value.

marcosploither · on Feb 28, 2013

joedev · on Feb 28, 2013

make that 5

downandout · on Feb 27, 2013

Agreed. How could a company with "millions in revenue" not backup critical databases? Not only were they exposed to the threat of human error, but hardware failures, hackers, etc. When he submitted his resignation, the company should have encouraged him to stay. Instead, anyone at the company that had anything to do with the failure to implement regular database backups and the use of redundant databases should have been fired.

I've had more than my fair share of failures in the start up world. It always drives me crazy to see internet companies that have absolutely no technical ability succeed, while competing services that are far superior technically never get any traction.

rodgerd · on Feb 27, 2013

It's the difference between understanding the incident (programmer makes mistake) and the root cause (failing at data integrity 101). Not just no backups, but no audit log of what's happened in the game - I'd expect an MMO to have an append-only event history quite apart from their state information.

_h8ft · on Feb 28, 2013

At least all transactions for purchased game items should have been logged in a separate database. There's nothing worse for a digital content company than forgetting who bought what.

happycube · on Feb 27, 2013

And worse, they had a backup service, and then dropped it to save money.

reinhardt · on Feb 27, 2013

This sounds so unreal that I am having doubts about the veracity of the story and would defer any judgement before hearing from the "other side".

angersock · on Feb 27, 2013

Having just helped a friend who was miffed at the idea of spending money on a new UPS for >$5k worth of networking equipment: do not be surprised.

Penny-wise pound-foolish.

dntrkv · on Feb 27, 2013

Spending $5k on a UPS is very different from not having backups of your production database which runs your multimillion dollar business. This story just doesn't add up.

joemoon · on Feb 28, 2013

You misunderstood the comment. Angersock is saying that the networking equipment cost more than $5k and the friend was unwilling to buy a UPS to protect the equipment.

georgemcbay · on Feb 27, 2013

Unfortunately it sounds perfectly plausible to me.

"Why are we paying for backups? The database has never failed (yet)"!

whyaduck · on March 1, 2013

Super important goal my boss gave me: reduce expenditures...CHECK! Can't wait for that bonus check!!

arethuza · on Feb 28, 2013

Sounds completely possible to me - I have worked in a lot of environments and at some places I've seen decisions that make cancelling backups look like an act of genius.

One interesting observation I can make: no correlation between excellence in operations and commercial success!

xradionut · on Feb 28, 2013

You would be amazed if I told you how many "vital" services are like the blog writers former employers: One button push from destruction.

artursapek · on Feb 28, 2013

Oh? Tell me.

MBCook · on Feb 28, 2013

I've seen something like this. It could be a situation where they were transitioning from that backup service to their own server or another service, and the move was never completed due to some hiccup or priority change.

But if no one is watching to ensure the move was finished (or they got distracted), then something many people treat as set-it-and-forget-it could easily get into that state.

eru · on March 6, 2013

If you are transitioning from one backup service to another, shouldn't you only cancel your old service, after you successful set up and tested the new one?

thebigkick · on Feb 28, 2013

I don't think it is that far-fetched. I've had experience with a well known company that does millions in revenue per week (web based shopping cart) that just FTP's everything with no DVCS. Designers, developers, managers all have access to the server and db.

pavel_lishin · on Feb 27, 2013

> It's noble of the author to quit

It's not just noble. Considering how everyone treated him, and the company's attitude in general, he had no future there, and neither should anyone else.

indrax · on Feb 27, 2013

Amazon is at the other end of the spectrum: randomly breaking things so that everything has to be fault tolerant.

http://www.codinghorror.com/blog/2011/04/working-with-the-ch...

This only happened because nobody even asked "What happens if I press this button?"

ajross · on Feb 27, 2013

This really needs to be more of a standard thing. I've been near (but as an engineer, never responsible for) production systems my whole career. None of these systems were as terribly maintained as the one in the linked article. Production data was isolated. Backups were done regularly. Systems were provisioned with fault tolerance in mind.

Not once have I seen a full backup restore tested. Not once have I seen a network failure simulated (though I've seen several system failures due to "kicking out a cable" that sort of acts as a proxy for that technique). On multiple occasions I've seen systems taken down by single points of failure[1] that weren't forseen, but probably could have been.

[1] My favorite: the whole closet went down once because everything was plugged into a single, very expensive, giant UPS that went poof. $40/system for consumer batteries from Office Depot would have been a much better bet. And the best part? Once the customer service engineer replaced whatever doodad failed and brought the thing back up? They plugged everything right back into it.

utunga · on Feb 27, 2013

I'll never forget when my boss was showing the girl scouts (literally) our very expensive UPS room. He explained how even if the power goes off we'll switch to batteries then switch over to generator power. See watch, he says - then flicked the switch. Fooomm... our entire office goes dark.

This took down news information for a good chunk of Wellington finance for about half a day. (Fortunately Wellington, NZ is a tiny corner of the finance world).

Hilarious! But I admit I was super glad it was the boss playing chaos monkey, not me.

brittohalloran · on Feb 28, 2013

Literally lol'd

jms · on Feb 28, 2013

Back when I worked for a small ISP, we had a diesel generator in case the power went out longer than our UPS batteries would last. This provided a great sense of security until we decided to test the system by powering off the main break and... it didn't start!

It turns out the emergency stop button was pushed in. Easy enough for us to fix then, but if the power had gone out at 4am it would have been quite another matter.

After that incident, we turned off the main breaker to the building weekly. It was great fun, as most of our offices were in the same building. We had complaints for the first couple of months until everyone got used to it and had installed mini UPS's for their office equipment.

We did actually have to use the generator for real a while later. Someone had driven their car into the local power substation, and it was at least a month until it was fixed. Electricity was restored through re-routing fairly quickly, but until the substation was repaired we were getting a reduced voltage that caused the UPSs to slowly drain...

miahi · on Feb 28, 2013

The last time they tested the diesel generator failover at a customer's site, the generator went on just fine, but then it did not want to switch to mains again. The whole building was powered by the generator for almost two days, until they managed to convince the generator to switch.

alexkus · on Feb 27, 2013

> Not once have I seen a network failure simulated.

Reminds me of the webserver UPS setup at a previous company.

The router (for the incoming T1) and the webserver were plugged in to the UPS.

UPS connected (via serial port) to webserver. Stuff running on webserver to poll whether UPS running from mains power or batteries and send panic emails if on batteries (for more than 60 seconds) and eventually shutdown the webserver cleanly if UPS power dropped below 25%.

Thing not plugged in to UPS: DMZ Network switch (that provided the connectivity between webserver and router).

groby_b · on Feb 27, 2013

Doing that kind of testing is hard. It costs time and effort. If you want to see it done on a truly awe-inspiring scale (whole data centers being taken down by zombies ;) : http://queue.acm.org/detail.cfm?id=2371516

ajross · on Feb 27, 2013

Doing this kind of testing in a gold-plated, heavily-engineered way is hard. But that's not an excuse for not doing it at all. Just walking into your closet and pulling a cable gets you 80-95% of the testing you need, and is free. Setting up a sandbox and "restoring" a backup onto it and then doing some quick queries is likewise easy to do and eliminates huge chunks of the failure space of "bad backups".

Really, this attitude (that things have to be done right) is part of the problem here. To a seasoned IT wonk, the only alternative to doing something "The Right Way" is not doing it at all. And that's a killer in situations like these.

Don't hack your systems to make them work. Absolutely do hack at them to test.

PeterisP · on Feb 28, 2013

"walking into your closet and pulling a cable" is not free, if your planned disaster recovery is not a seamless failover, but a process to recover data with some work and limited (nonzero) downtime/cost to business.

For example, our recovery plan for a financial mainframe in case of most major disasters was to restore the daily backup to off-site hardware identical to production hw; however, the (expensive) hardware wasn't "empty" but used as an acceptance test environment.

Doing a full test of the restore would be possible, but it would be a very costly disruption; taking multiple days of work for the actal environment restoration, deployment,testing and then all of this once more to build a proper acceptance-test-environment. Also destroying a few man-months worth of long tests-in-progress and preventing any change deployments while this is happening.

All of this would be reasonable in any real disaster, but such costs and disruptions aren't acceptable for routine testing.

"Chaos Monkey" works only if your infrastucture is built on cheap unstable and massively redundant items. You can also get excellent uptime with expensive, stable, massively controled environment with limited redundancy (100% guaranteed recovery, but not "hot failover") - but you can't afford chaos there.

ajross · on Feb 28, 2013

To paraphrase: if you go with a awful hack job for your disaster recovery plan, testing is more expensive. And to extend: you won't actually test because it's "too expensive", and your disaster recovery plan won't work.

How is this distinct from "Don't hack your systems to make them work. Absolutely do hack at them to test."? I don't see it.

This just sounds like "my business doesn't have the financial capacity to engineer data recovery processes". Well, OK then. Just don't claim to be doing it.

PeterisP · on Feb 28, 2013

We did know that we can recover backups because we did it for small parts of data, and we know that we can do disaster recovery because (a) we did test this, though very rarely; and (b) we had successfully recovered from actual full-scale disasters twice over ~7 years.

But successful, efficient disaster recovery plan doesn't always mean "no damage" - it often means damage mitigation; i.e., we can fix this with available resources while meeting our legal obligations so that our customers don't suffer; not that there aren't consequences at all - valid data recovery plans ensure that data recovery really is possible and details how it happens, but that recovery can be expensive. And while you can plan, document, train and test activities like "those 100 people will do X; and those 10 sales reps will call the involved customers and give them $X credit", you really don't want to put the plan into action without a damn good reason.

For example, a recovery plan for a bunch of disasters that are likely to cut all data lines from a remote branch to HQ involves documenting, printing & verifying a large pile of deal documents of the day, having them shipped physically and handled by a designated unit in the HQ. The process has been tested both as a practice and in real historical events.

However, if you "pull a wire in the closet" and cause this to happen just so, then you've just 'gifted' a lot of people a full night of emergency overtime work, and deserve a kick in the face.

ajross · on Feb 28, 2013

OK, tough love time:

All I can say is that you're very lucky to have a working system (and probably a company to work for), and I'm very lucky not to work where you do. Seriously, your "test" of a full disaster recovery was an actual disaster! More than one!

And frankly, if your response to the idea of implementing dynamic failure testing is that someone doing that should be "kicked in the face" (seriously, wtf? even the image is just evil), then shame on you. That's just way beyond "mistaken engineering practice" and well on the way to "Kafkaesque caricature of a bad IT department". Yikes.

Admittedly: you have existing constraints that make moving in the right direction expensive and painful. But rather than admit that you have a fragile system that you can't afford to engineer properly you flame on the internet against people who, quite frankly, do know how to do this properly. Stop.

PeterisP · on March 1, 2013

I'd like not to stop, but continue exploring the viewpoints. And I'd like you and others to try and consider also less-tech solutions to tech problems if they meet the needs instead of automatically assuming that we made stupid decisions.

For example, any reasonable factory also has a disaster recovery process to handle equipment damage/downtime - some redundant gear, backup power, inventory of spare parts, guaranteed SLA's for shipping replacement, etc; But still, someone intentionally throwing a wrench in the machine isn't "dynamic failure testing" but sabotage that will result in anger from coworkers who'll have to fix this. Should their system be called "improperly engineered"?

We had great engineers implementing failover for a few 'hot' systems, but after much analysis we knowingly chose not to do it 'your way' for most of them since it wasn't actually the best choice.

I agree, in 99% of companies talked about in HN your way is undoubtedly better, and in tech startups it should be the default option. But there, much of the business process was people & phone & signed legalese, unlike any "software-first" businesses; and the tech part usually didn't do anything better than the employees could do themselves, but it simply was faster/cheaper/automated. So we chose functional manual recoveries instead of technical duplications. And you have to anyway - if your HQ burns down, who cares if your IT systems still work if your employees don't have planned backup office space to do their tasks? IT stuff was only about half of the whole disaster recovery problems.

In effect, all the time we had an available "redundant failover system" that was manual instead of digital. It wasn't fragile (it didn't break, ever - as I said, we tried), fully functional (customers wouldn't notice) but very expensive to run - every hour of running the 'redundant system' meant hundreds of man-hours of overtime-pay and hundreds of unhappy employees.

So, in such cases, you do scheduled disaster-testing and budget the costs of these disruptions as neccessary tests - but if someone intentionally hurts his coworkers by creating random unauthorised disruptions, then it's not welcome.

The big disadvantage for this actually is not the data recovery or systems engineering, but the fact that it hurts the development culture. I left there because in such place you can't "move fast and break things", but everyone tends to ensure that every deployment really, really doesn't break anything. So we got there very good system stability, but all the testing / QA usually required at least 1-2 months for any finished feature to go live - which fits their business goals (stability & cost efficiency rather than shiny features) but demotivates developers.

alxndr · on Feb 28, 2013

What about Chaos Monkey?

hga · on Feb 27, 2013

Bingo. As I like to emphasize, people don't care about backups---this company certainly didn't---they care about restores.

And almost no one is willing to put up the money to do compete testing of restore paths, let along statistically making sure they continue to work.

grey-area · on Feb 27, 2013

My favourite way to test restores is to do them frequently to the dev server from the production backups - this keeps the dev data set up to date, and works as a handy test of the restore mechanism. Of course if you have huge amounts of data or files on production this becomes more difficult, but not impossible, to manage.

smoyer · on Feb 27, 2013

This works well, though you may need an "anonymizer" (and maybe some extra compliance testing) if your systems have PCI or HIPPA data on them. We have federal restrictions against storing certain types of data on servers outside the US. Cloud computing sounds great but neither Amazon or Google will guarantee the data stays within the country's borders.

lukeschlather · on Feb 28, 2013

Have you looked at https://aws.amazon.com/govcloud-us/ ?

Plutor · on Feb 27, 2013

Minor correction: the Chaos Monkey was Netflix's innovation. It just happened to be implemented on Amazon's cloud. It would have been just as useful if they had their own colocated servers or used a different cloud computing provider.

r00fus · on Feb 27, 2013

Apple did this before Amazon or Netflix in this regard [1], but the point needs to be made that a system needs to be tested and not just in a controlled aseptic way, because the real world isn't.

Another story supporting Chaos Monkey is what the Obama team did for their Narwhal infrastructure - they staged outages and random failures to prepare for their big day, meanwhile Romney's team who outspent the Obama team at least an order of magnitude, had their system fail on e-day.

[1] http://folklore.org/StoryView.py?story=Monkey_Lives.txt

pc86 · on Feb 27, 2013

I'd like to see a source for Romney outspending the Obama team "at least" 10x, because while I can speak from experience that ORCA was a gigantic piece of shit, it's not like the Obama people were struggling to pay their bills.

danso · on Feb 27, 2013

I don't know what metric the parent comment is referring to, but in terms of technology stack, I can fully believe that the Romney team spent more than Obama's team. Here's a post by one of the creators of the fundraising platform:

http://kylerush.net/blog/meet-the-obama-campaigns-250-millio...

The short of it is: they used static HTML generated by Jekyll and stored on S3.

pc86 · on Feb 27, 2013

I actually had that post in my mind when writing my reply, but I assumed r00fus was referring to ORCA and Narwhal specifically.

> ... what the Obama team did for their Narwhal infrastructure - they staged outages and random failures to prepare for their big day, meanwhile Romney's team who outspent the Obama team at least an order of magnitude, had their system fail on e-day.

liquidise · on Feb 27, 2013

Exactly what you said. Rigor is truly the right word to use here. Cancelling your db backups is basically asking for a disaster. I'm not sure i have ever been at job that didnt require a db backup for some reason at some time.

mooreds · on Feb 27, 2013

To say nothing of running development code against a production database.

If there were only two junior folks, what were the senior folks doing?

at-fates-hands · on Feb 27, 2013

This was my question. Usually the "junior" folks are shown how to do things by the senior engineers. The fact they threw this guy under the bus while letting the rest of the senior guys skate is appalling.

Part of your job as a senior developer is to ensure this very scenario doesn't happen, let alone to someone on your watch.

asveikau · on Feb 27, 2013

The junior dev was 22 at the time and fresh out of school.

I'm assuming the senior devs were 23 and had a year on the job. The principal devs were a ripe 24, otherwise known as old men. :-)

heathlilley · on Feb 27, 2013

He and his coworkers might not have been "Senior" before and they still might not be fully senior, but they are MUCH closer now.

>Ding! >Gratz!

liquidise · on Feb 27, 2013

The author makes mention of using a UI to connect to their db. If i was in a position over there i can see myself writing a script to clear out the tables i wanted. This reduces errors, but not the risk.

Cthulhu_ · on Feb 27, 2013

And yet, the net result is the same; right-click, clear table, or ./clearTable.sh. Both are human actions, and both are fallible. What if some prankster edited clearTable.sh to do the users table instead of the raids table? What if he did it himself to test something?

chris_wot · on Feb 27, 2013

Forget a prankster! What happens if the full RAID array fails that holds the database? No backups: dead company.

amputect · on Feb 27, 2013

Heck when you put it that way, this guy actually did them a FAVOR. He ONLY wiped out the User table. The company was able to learn the value of backups, and they had enough data left to be able to partially recover it from the remaining tables, which is much better than the worst case scenario.

dsuth · on Feb 28, 2013

Do you think the company learned the value of backups, or do you think they learned to blame junior devs for fuckups? Sounds like they learned nothing, because no attempt was made to determine the root cause.

Aardwolf · on Feb 28, 2013

Arrow up, enter Or ctrl+r <something> enter Can be very dangerous if something other than your routine is in the bash history.

podperson · on Feb 28, 2013

Exactly. It's not like nothing bad ever happened from the command line ;-)

rm * .doc

x3ord · on Feb 27, 2013

Exactly. I'm ashamed that my first reaction reading this was to blame OP. But in the 2 min it took to read the post I had come full circle to wondering what kind of terribly run company would allow this to happen--I guess the type that hires philosophy majors straight out of college without vetting their engineering skills.

pnathan · on Feb 27, 2013

> the type that hires philosophy majors straight out of college without vetting their engineering skills.

Or hires them and then adequately educating them in software development.

quanticle · on Feb 28, 2013

Or one that doesn't have a process to ensure that backup and recovery procedures when something like this happens are as painless as possible.

j_baker · on Feb 27, 2013

"Nobody cares about technical infrastructure. Our customers don't pay us for engineering rigor. We need to just ship!"

Of course the person saying that is likely to care about technical infrastructure when it costs them money and/or customers due to being hacked-together.

chris_mahan · on Feb 27, 2013

Nobody cares about the diameter of the cylinders in the engine, all they care about is going from point A to point B safely, affordably, in style.

gtirloni · on Feb 28, 2013

When I was a junior analyst, I once deleted the main table that contained all 70k+ users, passwords, etc. The problem was fixed in 15min after the DBA was engaged to copy all data back from the QA environment that was synchronized every X minutes. Or we could have restored a back from a few hours ago.

The whole company fucked this one up pretty badly. NO excuses.

jcromartie · on Feb 27, 2013

Indeed. If it had been a sporadic hardware failure, they would have been exactly as screwed. The fact that they gave an overworked junior dev direct read/write access to the production database is astounding.

gte910h · on Feb 27, 2013

It's not that they gave him r/w at all that's so criminally stupid. It's that the required him to clear a table manually, using generic full access tools, over and over.

In reality, this should have been re-factored to the dev db.

If it couldn't be, the junior dev should have been given access to the raids table alone for writes.

Lastly the developer who didn't back up this table is the MOST to blame. Money was paid for the state in that table. That means people TRUST you to keep it safe.

I count tons of people to blame here. I don't really see the junior dev as one of them.

sskates · on Feb 28, 2013

Yeah I think him manually clearing a table over and over again was the big problem here. The amount of entropy that had to be introduced into the process to turn a routine task into millions of dollars of loss was tiny. He just needed to click in a slightly different spot on the webpage to bring up USERS instead of RAIDS.

SteveGerencser · on Feb 28, 2013

I worked for a startup a few years back where the CEO deleted the entire 1+TB database/website when the server was hacked and being used as a spam server because he 1. didn't know who to disable the site short of deleting and 2. could reach anyone that did know how.

The next morning he told us to restore the site from backups and fix the security hole. That's when we reminded him, again, that he had refused to pay for backup services for a site of that size.

We all ended up looking fro new jobs withing a couple days.

readme · on Feb 28, 2013

I wish I could upvote this more. Production database? Seriously, they were one copy away from avoiding this whole outcome.

CEO sounds incompetent as hell.

mehwoot · on Feb 28, 2013

Hindsight is 20/20.

Aardwolf · on Feb 28, 2013

If you have only one copy of data, especially if it is important, the chance of something happening to that copy, either hardware, software, or human error, is always big enough to justify a backup. No hindsight needed for that.

readme · on Feb 28, 2013

Thinking about this just makes my mind go numb. All someone had to do was have the idea that they should back their database up. It would be done in 5 minutes, tops!

I automated ours on S3 with email notifications in under an hour...

fitandfunction · on Feb 27, 2013

Exactly. Problems with production databases are inevitable. It's just a matter of time.

The guy who should be falling on the sword, if anyone, is the person in charge of backups.

Better yet, the CEO or CTO should have made this a learning opportunity and taken the blame for the oversight + praised the team for banding together and coming up with a solution + a private chat with the OP.

benjaminwootton · on Feb 27, 2013

I came here to type that almost exactly word for word.

This is a truly amazing story if this system was really supporting milions in revenue.

jliptzin · on Feb 28, 2013

Unbelievable. Even at my 3 person startup, back in 2010, with thousands in revenue, not millions, we had development environments with test databases and automated daily database snapshotting. Sure I've accidentally truncated a few tables in my time, but luckily I wasn't dumb enough to be developing on a production server.

mncolinlee · on Feb 27, 2013

I cannot even begin to fathom how they functioned without a working development environment for testing, let alone let their backups lapse.

The kind of table manipulations he mentions would be unspeakable in most companies. Someone changing the wrong table would be inevitable. If I were an auditor, I would rake them all over the coals.

sybhn · on Feb 27, 2013

Absolutely. Your immediate technical management sucked, and you were made the scape goat for your management's failure. Welcome to the real world. Don't get me wrong, you should feel bad, very bad, bad enough so that never again you do that. But you shouldn't feel guilty nor rethink your career.

hga · on Feb 28, 2013

A little bit of feeling guilty is in order; the author "didn't know that he didn't know", and I'm sure this motivated him to learn a lot more about proper engineering processes ... something I'll note aren't particularly a focus in CS degrees. Especially since I haven't come across anyone who's really dedicated to them who hadn't first gotten burned in one way or another.

There's a big difference between being told "Do X, don't do Y" and that sinking feeling you get when you realize a big problem exists, regardless of the eventual outcome.

richo · on Feb 28, 2013

You're forgetting the guy who didn't speak up to say "Hey, maybe we shouldn't do this in prod?".

I feel for him, but at the same time there's a point at which you have to if testing guns by shooting them near (but not specifically at) your coworkers is actually a good idea.

coopdog · on Feb 28, 2013

That's the Agile way, people not processes.

When the people screw up, there's no processes to hold them back, they can really, really screw it up.

Once you hit millions in revenue it's probably wise to put a couple of fall back processes in place, reliability becomes as important as agility.

gawker · on Feb 27, 2013

Agree with the point that humans are fallible. We should always have back ups. A company with 1000s of paying customers should at least take steps to protect itself from this sort of catastrophe.

cybernomad99 · on Feb 28, 2013

To the credit of the management, they did not fire him. He resigned. But the coworkers felt he was responsible personally. That makes a uneasy work environment.

ftay · on Feb 28, 2013

That's not much to give credit for. They could certainly have done more to help him recover from this.

jalayr · on Feb 27, 2013

Completely agreed.

columbo · on Feb 27, 2013

News flash,

If you are a CEO you should be asking this question: "How many people in this company can unilaterally destroy our entire business model?"

If you are a CTO you should be asking this question: "How quickly can we recover from a perfect storm?"

They didn't ask those questions, they couldn't take responsibility, they blamed the junior developer. I think I know who the real fuckups are.

As an aside: Way back in time I caused about ten thousand companies to have to refile some pretty important government documents because I was doubling xml decoding (& became &amp;). My boss actually laughed and was like "we should have caught this a long time ago"... by we he actually meant himself and support.

hga · on Feb 27, 2013

If you are a CEO you should be asking this question: "How many people in this company can unilaterally destroy our entire business model?"

In high tech this can get really messy, these are frequently inherently more fragile companies. My favorite example is from Robert X. Cringley in this great book: http://www.amazon.com/Accidental-Empires-Silicon-Millions-Co... ; from memory:

One day Intel's yields suddenly went to hell (that's the ratio of working die on a wafer to non-working, and is a key to profitability). And no matter how hard they tried, they could only narrow it down to the wafers being contaminated, but the wafer supplier swore up and down they were shipping good stuff, and they were. So eventually they tasked a guy to follow packages from the supplier all the way to the fab lines, and he found the problem in Intel's receiving department. Where a clerk was breaking open the sealed packages and counting out the wafers on his desk to make damned sure Intel was getting its money worth....

His point is that you can have a Fortune 500 company, normally thought to be stable companies that won't go "poof" without ample warning, in which there are many more people than in previous kinds of companies who can very quickly kill it dead.

gefh · on Feb 27, 2013

I physically cringed at that. Even the mail clerk should have noticed there was a 'big deal' about clean rooms and had some idea what the company he worked for did...

wpietri · on Feb 27, 2013

We are all born naked, bloody, and screaming; the only thing we know is how to work a nipple. Everything else has to be learned.

One of Toyota's mantras is "If the student has failed to learn, the teacher has failed to teach." Their point is that managers are responsible for solving issues that come from employee ignorance, not line workers.

HeyLaughingBoy · on Feb 27, 2013

Exactly. In my organization I know of perfectly good hardware that is either being tossed or used for non-critical applications because someone didn't follow the Incoming Inspection process correctly. It doesn't matter that they could simply be inspected now and found to be perfect, the process wasn't followed, so the product is "junk."

desas · on March 1, 2013

Babies have to learn how to work a nipple, it's not some innate human knowledge.

wpietri · on March 3, 2013

We may be quibbling over definitions. But do you believe that no mammal has an instinctual understanding of using nipples? Or that humans are unique among mammals in that regard?

desas · on March 4, 2013

I'm afraid I have no idea about the answer to either of your questions. I assumed the "we" in your original post meant "we humans".

I do understand that humans don't have an instinctual understanding of using nipples, human babies have a physical sucking reflex that kicks in when you put anything near their mouths. They usually quickly learn that sucking a nipple in a particular way gives out lots of yummy milk.

geoka9 · on Feb 28, 2013

Where a clerk was breaking open the sealed packages and counting out the wafers on his desk to make damned sure Intel was getting its money worth....

I find this hard to believe. At some point a person in a space suit was introducing them into a clean room; she should have noticed that the packages were not sealed.

pja · on Feb 28, 2013

Maybe the clerk was carefully sealing them back up again? In retrospect they probably should have had tamper-proof seals given the value attached to the wafers being delivered unopened, but then most problems are easily avoidable in retrospect.

hga · on Feb 28, 2013

Exactly; this happened early enough in the history of ICs that I'm sure they weren't taking such precautions, just like tamper revealing seals for over the counter drugs didn't become big until after the Chicago Tylenol murders.

There's a good chance a tamper revealing seal would have stopped the clerk from opening the containers, and of course it they'd been broken before reaching the people at the fab lines who were supposed to open them that would have clued Intel into the problem before any went into production and would have allowed them to quickly trace the problem back upstream.

deservingend · on Feb 28, 2013

Seems like sabotage.

hga · on Feb 28, 2013

"Never attribute to malice that which is adequately explained by stupidity." (http://en.wikipedia.org/wiki/Hanlons_razor)

Or in this case energy, ignorance, and not learning enough of the big picture, or wondering why these wafers were sealed in air tight containers in the first place.

It's the well meaning mistakes that tend to be the most dangerous since most people are of good will, or at least don't want to kill the company and lose their jobs.

novax81 · on Feb 27, 2013

The difference between working with bad bosses and good bosses really shows itself when there's a disaster going on.

A mere few months into my current job, I ran an SQL query that updated some rows without properly checking one of the subqueries. Long story short - I fucked up an entire attribute table on a production machine (client insisted they didn't need a dev clone). I literally just broke down and started panicking, sure that I'd just lost my new job and wouldn't even have a reference to run on for a new one.

After a few minutes of me freaking out, my boss just says: "You fucked up. I've fucked up before, probably worse than this. Let's go fix it." And that was it. We fixed it (made it even better than it was, as we fixed the issue I was working on in the process), and instead of feeling like an outcast who didn't belong, I learned something and became a better dev for it.

zdgman · on Feb 27, 2013

I really wish this was at the top. Everyone will fuck up at some point (even your best engineer). Whether you learn anything from fucking up really determines how bad your mistake was.

ritchiea · on Feb 27, 2013

This is so true. I had to leave my first junior dev position for similar reasons as the OP, though nothing as monumental.

I was handed a legacy codebase with zero tests. I left a few small bugs in production, and got absolutely chewed out for it. It was never an issue with our processes, it was obviously an issue with the guy they hired who had 1 intro CS class and 1 rails hobby project on his resume. The lead dev never really cared that we didn't have tests, or a careful deploy process. He just got angry when things went wrong. And even gave one more dev with even less experience than I had access to the code base.

It was a mess and the only thing I was "learning" was "don't touch any code you don't have to because who knows what you might break" which is obviously a terrible way to learn as a junior dev (forget "move fast and break things" we "move slowly and work in fear!"). So I quit and moved on, it was one of the better decisions I've ever made.

paganel · on Feb 27, 2013

As a programmer I consider myself very lucky that one of the first advices I got when I was a junior was from one of my senior colleagues (and a very smart guy): "one of the most valuable qualities of a good programmer is courage".

Seven and a half years later I make sure that I pass that knowledge on to my junior colleagues. I'm proud to say that just in the past 2 weeks I've said this twice to one of my younger team-mates, a recent hire, "don't be afraid to break things!"

danso · on Feb 27, 2013

Heh. Another way I would say that is that "discretion is the better part of valor" And, to rip-off Hitchhiker's Guide to the Galaxy, "cowardice is the better part of discretion"...in that if fear makes you judiciously check your backups and write tests, then that's not a bad thing at all.

ritchiea · on Feb 27, 2013

I believe developers need the courage to do what we feel is necessary and the paranoia to quadruple check our work and assumptions.

Retric · on Feb 27, 2013

Paranoia is often useful, but with a good enviorment, and tools it's rarely need. I find bad assumptions often cause the worst problems. Break things regularly and you end up with fewer assumptions about the code base / production environment which is a vary good thing.

chris_mahan · on Feb 28, 2013

I don't trust tools unless I wrote them myself, in which case I know there will be bugs in them.

yaj · on Feb 27, 2013

This is the right thing to encourage but I just would like to add always have a backup.

"Don't be afraid to break things as long as you have a backup".

It might be a simple version of the previous code, database copy or even the entire application. Do not forget to backup. If everything fails, we can quickly restore the previous working version.

sixtypoundhound · on Feb 27, 2013

I would go one step farther....

Every production deployment should involve blowing away the prior instance, rebuilding from scratch, and restarting the service; you are effectively doing a near-full "restore" for every deployment, which forces you to have everything fully backed up and accessible...

Any failure to maintain good business continuity practices will manifest early for a product / employee / team, which allows you to prevent larger failures...

mcherm · on Feb 28, 2013

Spoken like a man who has maintained applications but never databases.

In the world where data needs to be maintained, this is not necessarily an option. In the bank where I work, we deploy new code without taking any outage (provide a new set of stored procedures in the database, then deploy a second set of middleware, then a new set of front-end servers, test it, then begin starting new user sessions on the new system; when all old user sessions have completed the old version can be turned off). Taking down the database AT ALL would require user outages. Restoring the database from backup is VERY HARD and would take hours (LOTS of hours).

That being said, we do NOT test our disaster-recovery and restore procedures well enough.

mikec3k · on Feb 27, 2013

Use source control. You can always revert.

ghjm · on Feb 27, 2013

This is correct and is the way it should be. So how come the programmers are always politically gunning for the keys to the production server cabinet, where you do have to be afraid to break things?

ramchip · on Feb 28, 2013

Are they? Where I work (a bank) we were more than happy to move to read-only access to prod servers, and pass by a support team when we need to deploy things.

bcbrown · on Feb 28, 2013

That's where raises and bonuses come from.

Shorel · on Feb 27, 2013

My own rule now is:

The bug is in the part that I though it was so obvious, I missed some check.

PeterisP · on Feb 28, 2013

Hmm, is it courage like "real men test in production environment" ?

smrtinsert · on Feb 27, 2013

slowly and work in fear businesses usually collapse. the technical problem eventually becomes a business problem and no one realizes it can be solved at the source level.

it first builds inefficieny at the technical level then at the business level, finally it causes issues at the cultural level, and thats when the smart people start leaving.

lsc · on Feb 27, 2013

>If you are a CEO you should be asking this question: "How many people in this company can unilaterally destroy our entire business model?"

This is a question that the person in charge of backups needs to think about, too. I mean, rephrase it as "Is there any one person who can write to both production and backup copies of critical data?" but it means the same thing as what you said.

(and if the CTO, or whoever is in charge of backups screws up this question? the 'perfect storm' means "all your data is gone" - dono about you, but my plan for that involves bankruptcy court, and a whole lot of personal shame. Someone coming in and stealing all the hardware? not nearly as big of a deal, as long as I've still got the data. My own 'backup' house is not in order, well, for lots of reasons, mostly having to do with performance, so I live with this low-level fear every day.)

Seriously, think, for a moment. There's at least one kid with root on production /and/ access to the backups, right? At most small companies, that is all your 'root-level' sysadmins.

That's bad. What if his (or her) account credentials get compromised? (or what if they go rogue? it happens. Not often, and usually when it does it's "But this is really best for the company" It's pretty rare that a SysAdmin actively and directly attempts to destroy a company.)

(SysAdmins going fully rogue is pretty rare, but I think it's still a good thought experiment. If there is no way for the user to destroy something when they are actively hostile, you /know/ they can't destroy it by accident. It's the only way to be sure.)

The point of backups, primarily, is to cover your ass when someone screws up, primarily. (RAID, on the other hand, is primarily to cover your ass when hardware fails) - RAID is not Backup and Backup is not RAID. You need to keep this in mind when designing your backup, and when designing your RAID.

(Yes, backup is also nice when the hardware failure gets so bad that RAID can't save you; but you know what? that's pretty goddamn rare, compared to 'someone fucked up.')

I mean, the worst case backup system would be a system that remotely writes all local data off site, without keeping snapshots or some way of reverting. That's not a backup at all; that's a RAID.

The best case backup is some sort of remote backup where you physically can't overwrite the goddamn thing for X days. Traditionally, this is done with off-site tape. I (or rather, your junior sysadmin monkey) writes the backup to tape, then tests the tape, then gives the tape to the iron mountain truck to stick in a safe. (if your company has money; if not, the safe is under the owner's bed.)

I think that with modern snapshots, it would be interesting to create a 'cloud backup' service where you have a 'do not allow overwrite before date X' parameter, and it wouldn't be that hard to implement, but I don't know of anyone that does it. The hard part about doing it in house is that the person who managed the backup server couldn't have root on production and vis-a-vis, or you defeat the point, so this is one case where outsourcing is very likely to be better than anything you could do yourself.

Piskvorrr · on Feb 27, 2013

> If there is no way for the user to destroy something when they are actively hostile, you /know/ they can't destroy it by accident.

Which also means they can't fix something in case of a catastrophic event. "Recover a file deleted from ext3? Fix a borked NTFS partition? Salvage a crashed MySQL table? Sorry boss, no can do - my admin powers have been neutered so that I don't break something 'by accident, wink wink nudge nudge'." This is, ultimately, an issue of trust, not of artificial technical limitations.

> one case where outsourcing is very likely to be better than anything you could do yourself.

Hm. Your idea that "cloud is actually pixie dust magically solving all problems" seems to fail your very own test. Is there a way to prevent the outsourced admins from, um, destroying something when they are actively hostile? Nope, you've only added a layer of indirection.

(also, "rouge" is "#993366", not "sabotage")

lsc · on Feb 27, 2013

>> If there is no way for the user to destroy something when they are actively hostile, you /know/ they can't destroy it by accident.

>Which also means they can't fix something in case of a catastrophic event. "Recover a file deleted from ext3? Fix a borked NTFS partition? Salvage a crashed MySQL table? Sorry boss, no can do - my admin powers have been neutered so that I don't break something 'by accident, wink wink nudge nudge'." This is, ultimately, an issue of trust, not of artificial technical limitations.

All of the problems you describe can be solved by spare hardware and read only access to the backups. I mean, your SysAdmin needs control over the production environment, right? to do his or her job. but a sysadmin can function just fine without being able to overwrite backups. (assuming there is someone else around to admin the backup server.)

fixing my spelling now.

Yes, it's about trust. but anyone who demands absolute trust is, well, at the very least an overconfident asshole. I mean, in a properly designed backup system (and I don't have anything at all like this at the moment) I would not have write-access to the backups, and I'm majority shareholder and lead sysadmin.

That's what I'm saying... backups are primarily there when someone screwed it up... in other words, when someone was trusted (or trusted themselves) too much.

Piskvorrr · on Feb 27, 2013

Okay, now I think I understand you, and it seems we're actually in agreement - there is still absolute power, but it's not all concentrated in one user :)

(that rouge/rogue thing is my pet peeve)

lsc · on Feb 27, 2013

>Hm. Your idea that "cloud is actually pixie dust magically solving all problems" seems to fail your very own test. Is there a way to prevent the outsourced admins from, um, destroying something when they are actively hostile? Nope, you've only added a layer of indirection.

the idea here is to make sure that the people with write-access to production don't have write-access to the backups and vis-a-vis. The point is that now two people have to screw it up before I lose data.

Outsourcing has it's place. You are an idiot if you outsource production and backups to the same people, though. This is why I think "the cloud" is a bad way of thinking about it. Linode and rackspace are completely different companies... one of them screwing it up is not going to effect the other.

joecurry · on Feb 27, 2013

>> I think that with modern snapshots, it would be interesting to create a 'cloud backup' service where you have a 'do not allow overwrite before date X' parameter, and it wouldn't be that hard to implement, but I don't know of anyone that does it.

I test backups for F500 companies on a daily basis (IT Risk Consulting) - this would be missing the point really, the business process around this problem is really moving towards live mirrored replication. This allows much faster recall time, and also mitigates many risks with the conventional 'snapshot' method through either tapes, cloud, etc.

stephengillie · on Feb 27, 2013

I think that with modern snapshots, it would be interesting to create a 'cloud backup' service where you have a 'do not allow overwrite before date X' parameter, and it wouldn't be that hard to implement, but I don't know of anyone that does it.

Does Amazon Glacier offer this?

ricardobeat · on Feb 27, 2013

I think since it uses generated IDs for each archive, it's impossible to overwrite anything.

johngalt · on Feb 27, 2013

From a sysadmin:

Redundancy = Reduce the number of component failures that can lead to system failure (RAID, live replication, hot standby).

Backup = Recover from an obvious failure or overwite (Weekly full backups, daily differentials).

Archival = Recover from a non-obvious failure and/or malicious activity (WORM tapes, offsite backup).

As a failsafe against malicious sysadmins is to split up the responsibilities. The guy handling backups isn't handling archival etc...

pja · on Feb 28, 2013

I've said it elsewhere, but it bears repeating: RAID is about availability first and foremost. The fact that it happens to preserve your data in the case of one form of hardware failure is a side effect of its primary goal.

seivan · on Feb 27, 2013

Well said!

xentronium · on Feb 27, 2013

This is certainly a monumental fuckup, but these things inevitably happen even with better development practices, this is why you need backups, preferably daily, and as much separation of concerns and responsibilities as humanly possible.

Anecdote:

I am working for a company that does some data analysis for marketers aggregated from a vast number of sources. There was a giant legacy MyISAM (this becomes important later) table with lots of imported data. One day, I made some trivial looking migration (added a flag column to that table). I tested it locally, rolled it out to staging server. Everything seemed A-OK until we started migration on the production server. Suddenly, everything broke. By everything, I mean EVERYTHING, our web application showed massive 500-s, total DEFCON1 across the whole company. It turned out we ran out of disk space, since apparently myisam tables are altered the following way: first the table is created with updated schema, then it is populated with data from the old table. MyISAM ran out of disk space and somehow corrupted the existing tables, mysql server would start with blank tables, with all data lost.

I can confirm this very feeling: "The implications of what I'd just done didn't immediately hit me. I first had a truly out-of-body experience, seeming to hover above the darkened room of hackers, each hunched over glowing terminals." Also, I distinctly remember how I shivered and my hands shook. It felt like my body temperature fell by several degrees.

Fortunately for me, there was a daily backup routine in place. Still, several hour long outage and lots of apologies to angry clients.

"There are two types of people in this world, those who have lost data, and those who are going to lose data"

perlgeek · on Feb 27, 2013

Reading those stories makes me realize how well thought-out the process at my work is:

We have dev databases (one of which was recently empty, nobody knows why; but that's another matter), then a staging environment, and finally production. And the database in the staging environment runs on a weaker machine than the prod database. So before any schema change goes into production, we do a time measurement in the staging environment to have a rough upper bound for how long it will take, how much disc space it uses etc.

And we have a monthly sync from prod to staging, so the staging db isn't much smaller than prod db.

And the small team of developers occasionally decides to do a restore of the prod db in the development environment.

The downside is that we can't easily keep sensitive production data to find its way into the development environment.

wpietri · on Feb 27, 2013

When moving data from prod to other environments, consider a scrambler. E.g., replace all customer names with names generated from census data.

I try to keep data having the same form (e.g., length, number of records, similar relationships, looks like production data). But it's random enough so that if the data ever leaks, we don't have to apologize to everybody.

Since your handle is perlgeek, you're already well equipped to do a streaming transformation of your SQL dump. :)

Domenic_S · on Feb 27, 2013

Yep. For x.com I wrote a simple cron job that sterilizes the automated database dump and sends it to the dev server. Roughly, it's like this:

-cp the dump to a new working copy

-sed out cache and tmp tables

-Replace all personal user data with placeholders. This part can be tricky, because you have to find everywhere this lives (are form submissions stored and do they have PII?)

-Some more sed to deal with actions/triggers that are linked to production's db user specifically.

-Finally, scp the sanitized dump to the dev server, where it awaits a Jenkins job to import the new dump.

The cron job happens on the production DB server itself overnight (keeping the PII exposure at the same level it is already), so we don't even have to think about it. We've got a working, sanitized database dump ready and waiting every morning, and a fresh prod-like environment built for us when we log on. It's a beautiful thing.

pc86 · on Feb 28, 2013

This sounds like it'd make a good blog post.

MBCook · on Feb 27, 2013

Ug. MyISAM.

More than once (over the last few years) I've been doing some important update. I tend to do it the same way.

  START TRANSACTION;
  --run SQL--
  --check results--
  COMMIT; or ABORT TRANSACTION;

Of course, if you happen to run into the 1 or 2 MyISAM tables that no one knew were MyISAM, abort doesn't do anything. You've screwed up the data and need a backup.

So you always have to make a backup and check that the tables are defined they way they should be. Nothing is quite as much fun as the first time you delete a bunch of critical data off a production system. Luckily the table was small, basically static, and we had backups so it was only a problem for ~5 minutes.

mst · on Feb 27, 2013

Any CREATE, ALTER or DROP on mysql does an implicit, silent COMMIT (not rollback) before running it.

PostgreSQL can do DDL inside a transaction though.

MBCook · on Feb 27, 2013

Yep, that's another one to watch out for. But I'm used to that, it's when an update or delete that can't be reversed I get surprised.

hef19898 · on Feb 27, 2013

Remebers me of adifferent story from back the day when I worked around a pretty huge SAP system (as some kind of super user whatever). In one, seemingly trivial update (trivial compared to the complete system upgrade from one version to the higher one that worked without problems) cleansed the database including all purchase orders from the last two days company wide. Ah, and the back-up becamo "unusable", too.

But as far as I know, nobody was fired for this. Because, yes, things like this just can happen. An eventually it got fixed anyway.

grey-area · on Feb 27, 2013

Tens of thousands of paying customers and no backups?

No staging environment (from which ad-hoc backups could have been restored)!?!?

No regular testing of backups to ensure they work?

No local backups on dev machines?!?

Using a GUI tool for db management on the live db?!?!?

No migrations!?!?!

Junior devs (or any devs) testing changes on the live db and wiping tables?!?!?!

What an astonishing failure of process. The higher ups are definitely far more responsible than some junior developer for this, he shouldn't have been allowed near the live database in the first place until he was ready to take changes live, and then only on to a staging environment using migrations of some kind which could then be replayed on live.

They need one of these to start with, then some process:

http://www.bnj.com/cowboy-coding-pink-sombrero/

jakejake · on Feb 27, 2013

My hypothesis is that it's a game company and all of the focus was on the game code. The lowly job of maintaining the state server was punted off to the "junior dev" just out of school. Nobody was paying attention. It was something that just ran.

They paid the price of ignoring what was actually the most critical part of their business.

ptaipale · on Feb 28, 2013

I disagree slightly. If you're a game company, your most critical part of the business is the game.

Even if you have a rock-solid database management, backup, auditing etc process, if your game is not playable, you won't have any data that you could lose by having a DB admin mis-click.

Still, not handling your next-most-critical data properly is monumentally stupid and a collective failure of everyone who should have known.

grey-area · on Feb 27, 2013

Sounds convincing. I guess now they'll realise their mistake though as the servers are critical to their business.

narrator · on Feb 27, 2013

The development environment should not be able to make a direct connection to production. GitHub temporarily deleted their whole prod database because of a config screwup because the dev server could talk to the production db. https://github.com/blog/744-today-s-outage

grey-area · on Feb 27, 2013

I wasn't suggesting it should, ideally they'd be on completely isolated machines, and there's no reason it has to connect to production. Just because you use production backups to set up your dev environment, doesn't mean the dev environment should be able to talk to production servers, quite the opposite.

What I'd normally do is have a production server, which has daily backups, copies of which are used for dev on local machines, and then pushed to a dev server with separate dev db which is wiped periodically with that production data (a useful test of restoring backups), and has no connection with the production server or db.

Can't work out why they would possibly be doing development on a live db like this, that's insanity.

hugoc · on Feb 28, 2013

I worked for the largest cellphone carrier in my country. I had write permission for the production db (not to all views, though) from my second week there onwards, the first week I used the credentials of the guy training me. The guy training knew the whole thing was wrong, mainly because once he ran a query that froze the db for half day. I was not a developer I was working in the help desk.

Domenic_S · on Feb 27, 2013

> Using a GUI tool for db management on the live db?!?!?

I still use the mysql CLI and have for 10 years plus-or-minus, but I actually use Sequel Pro a lot. If I'm perusing tables with millions of rows, or I want to quickly see a schema, or whatever, it's been a net gain in productivity.

http://www.sequelpro.com/

Kiro · on Feb 27, 2013

What do you mean by "no migrations"?

beilabs · on Feb 27, 2013

Think of your database in terms of code version control. You want to track the changes that are made to it, adding a column could be a migration, renaming a column could be another.

A migration enables you to track the changes you made and possibly rollback to previous migrations (database states) if ever required.

juan_juarez · on Feb 27, 2013

Hell - this could have been avoided if they weren't using graphical tools or had a database that used transactions.

krallja · on Feb 27, 2013

Neither of those things would have fixed this problem.

1. "Oops, I wrote TRUNCATE TABLE User instead of TRUNCATE TABLE Raids"

2. Transaction complete. ... ... "oops!"

whatshisface · on Feb 27, 2013

Well, to be fair it's harder to type User than it is to miss your rightclick by something that could have only been a few pixels.

derekp7 · on Feb 28, 2013

Also, I can't tell you how many times I've attempted to click on a button on a web form, and it was still loading and the button moved (along with a different button appearing in its place).

themonk · on Feb 27, 2013

Why type "TRUNCATE TABLE Raids" every time? Type it once, test it, save as script, schedule that script.

krallja · on Feb 28, 2013

That is definitely another bug in this company's process. This should have been automated. It's much harder to screw up running a script!

ptaipale · on Feb 28, 2013

Not really. You obviously haven't (yet) done anything like update t1 set status=0; where status=4; when you wanted to release (set status to 0) objects that are stuck in state 4, and let all other objects keep their existing statuses.

This is an easy mistake to make on command line. I hate GUIs too but not having one doesn't really help when your fundamental operating model is wrong.

devicenull · on Feb 28, 2013

And that's why you use --i-am-a-dummy on the command line :)

http://dev.mysql.com/doc/refman/4.1/en/mysql-command-options...

cmos · on Feb 27, 2013

When I was 18 I took out half my towns power for 30 minutes with a bad scada command. It was my summer job before college and I went from cleaning the warehouse to programming the main SCADA control system in a couple weeks.

Alarms went off, people came running in freaking out, trucks started rolling out to survey the damage, hospitals started calling about people on life support and how the backup generators were not operational, old people started calling about how they require AC to stay alive and should they take their loved ones on machines to the hospital soon.

My boss was pretty chill about it. "Now you know not to do that" were his words of wisdom, and I continued programming the system for the next 4 summers with no real mistakes.

gnarbarian · on Feb 28, 2013

I'm interested in knowing some more details regarding the architectural setup and organizational structure that would allow something like this to happen.

cmos · on March 2, 2013

It was the small New England town I grew up in outside of Boston. Typically they manage their own power distribution, and purchase power from a utility at a rate fixed by the highest peak usage of the year, which tends to be 1:00 sometime in early August.

I was setting up a system to detect the peak usage and alert the attendant to send a fax (1992) to our local college generator plant to switch over some of their power to us, thus reducing our yearly power bill by a lot.

There was a title grabber in charge of the department, and a highly competent engineer running everything and keeping the crews going. That was my boss. Each summer I was doing new and crazy things, from house calls for bad TV signals from grounding to mapping out the entire grid. Oddly enough there was no documentation as to what house was on what circuit.. it was in the heads of the old linesman. Most of the time when they got a call about power outages they drove around looking at what houses were dark and what were not.

Sometimes the older linesmen would call me on the radio to meet them somewhere, and we'd end up in their back yards taking a break having some beers. I learned a lot from those old curmudgeonly linesmen. They made fun of me non stop, but always with a wink and roaring laughter. Especially when I cut power to half the town.

sophacles · on Feb 28, 2013

Honestly, you don't. The IT engineering in power and other SCADA systems is downright scary.

gnarbarian · on Feb 28, 2013

sounds like an opportunity to me.

sophacles · on Feb 28, 2013

It's a hard space to break into. The businesses are conservative about any change, and new vendors are fairly un-trusted. Further, it is a bit different of a world than normal software realms, due to crazy long legacy lifetimes. Another problem is that the money isn't as big as you'd think, its a surprisingly small field.

All that being said, there is opportunity, just not easy opportunity. And a huge number of the people in it are boomers, so there is going to be big shake-ups in the next decade or two.

cedsav · on Feb 27, 2013

Whoever was your boss should have taken responsibility. Someone gave you access to the production database instead of setting up a proper development and testing environment. For a company doing "millions" in revenues, it's odd that they wouldn't think of getting someone with a tiny bit of experience to manage the development team.

tseabrooks · on Feb 27, 2013

We sell middleware to a number of customers with millions of dollars in revenue who don't have backups, don't have testbeds for rolling out to "dev" before pushing to "prod" and don't have someone with any expertise in managing their IT / infrastructure needs.

My experience is that this is the norm, not the exception.

Piskvorrr · on Feb 27, 2013

Then perhaps "insane" is the word we're looking for, not "exceptional." Lack of sanity can be the norm, but that still doesn't make it desirable.

jzelinskie · on Feb 27, 2013

As a college student, this fucking horrifies me. Is there anyway I can guarantee I don't end up at someplace as unprofessional as this? I want to learn my first job not teach/lead.

potatolicious · on Feb 27, 2013

The interview advice here is excellent. Ask questions - in the current climate they're hunting you, not the other way around.

Additionally, start networking now. Get to know ace developers in your area, and you will start hearing about top-level development shops. Go to meetups and other events where strong developers are likely to gather (or really, developers who give a shit about proper engineering) and meet people there.

It's next to impossible to know, walking into an office building, whether the company is a fucked up joke or good at what it does - people will tell you.

X-Istence · on Feb 27, 2013

An interview is a two way street. You too can ask questions, make sure you fit within the team, and that the job is to your liking. That is the time to ask and find out about anything that may be pertinent to your job.

Also, you have a choice of leaving if you don't like the job, and or don't find the practices in place to be any good, or you can fix them.

hglaser · on Feb 27, 2013

The Joel Test stands the test of time: http://www.joelonsoftware.com/articles/fog0000000043.html

It doesn't cover every last thing, but a team following these practices is the kind of team you're looking for. Ask these questions at your interviews.

protomyth · on Feb 27, 2013

Ask about their development and staging process during the technical interview. Ask about how someone gets a piece of code into production (listen for mention of different environments).

wpietri · on Feb 27, 2013

I think it's entirely fair to haul out the Joel Test and say you'd just like to hear how they handle those things.

I honestly think the test is a little out of date, but if they say, "Well, instead of X we're doing better thing Y", that's a great answer.

jakejake · on Feb 27, 2013

Well you should certainly look for a team that gives you a good impression with their dev process.

But, things are not always perfect, even on great teams. Not saying its normal to destroy your production database! But even in good shops it's a constant challenge to stay organized and do great work. Look for a team that is at least trying to do great work, rather than a complacent team.

alttab · on Feb 27, 2013

Normal = OK.

wyqueshocec · on Feb 27, 2013

This. Because of this and the lack of backups.

lucian1900 · on Feb 27, 2013

Even worse, they didn't have a backup.

mootothemax · on Feb 27, 2013

The CEO leaned across the table, got in my face, and said, "this, is a monumental fuck up. You're gonna cost us millions in revenue".

No, the CEO was at fault, as was whoever let you develop against the production database.

If the CEO had any sense, he should have put you in charge of fixing the issue and then making sure it could never happen again. Taking things further, they could have asked you to find other worrying areas, and come up with fixes for those before something else bad happens.

I have no doubt that you would have taken the task extremely seriously, and the company would have ended up in a better place.

Instead, they're down an employee, and the remaining employees know that if they make a mistake, they'll be out of the door.

And they still have an empty users table.

j_baker · on Feb 27, 2013

To be fair, if the CEO were willing to take those steps, the company would probably not have a deleted USERS table.