The public report is nice and we can see a sequence of mishaps from it, that shouldn't have been allowed to happen but which (unfortunately) are not that uncommon. I've done my share of mistakes, I know what's like to be in emergency mode and too tired to think straight, so I'm going to refrain from criticizing individual actions.
What I'm going to criticize is the excess of transparency:
You absolutely DO NOT publish postmortems referencing actions by NAMED individuals, EVER.
From reading the whole report it's clear that the group is at fault, not a single individual. But most people won't read the whole thing, even less people will try to understand the whole picture. That's why failures are always attributed publicly to the whole team, and actions by individuals are handled internally only.
And they're making it even worse by livestreaming the thing! It's like having your boss looking over your shoulder but a million times worse...
I myself initially added my name to the document in various parts, this was later changed to just initials. I specifically told my colleagues it was OK to keep it in the document. I have no problems taking responsibility for mistakes, and making sure they don't happen ever again.
I think perhaps, you want to not do this in the future.
Incident reports are about focusing on the "what" and "when" not the "who". This is not about taking responsibility (you don't need to be published on the internet to do that) and you can always have a follow up post after the incident report has been published as a "what I learned during incident X".
While it's great you're OK with publishing your name, you've now set a precedent that says it's OK to do this to other developers. A blanket policy on keeping names out of the incident report protects others who may not be as willing to get their name of HN (as well as not having to make amendments or retractions if the initial assumptions are incorrect). It also keeps a sense of professionalism as it's clear that no blame is being assigned.
I know that you guys are not assigning blame, but if I was to show this to someone outside of this discussion, they'd assume that it was a fingerpointing exercise, which does not reflect well on Gitlab.
I think you're blowing it out of proportion. If you showed it to someone and they told you they assumed it was about fingerpo...Look, it's not that big a deal. They decided to do it, not everything is a blame game.
Woah there! I think you may have misread the parent as it looks like some friendly advice to me (with actual reasons and stuff), rather than the "You shouldn't have done that! You've destroyed your company!!!!" you seem to have read it as.
Heck, they didn't even say to retract anything from the report, just maybe to leave adding names to things until a later date in future incidents.
"It also keeps a sense of professionalism as it's clear that no blame is being assigned" is not the same as "you guys acted unprofessionally!". It's letting the GitLab guys know there's a potential problem with the communication style at that point in the story.
I find it funny that in a comments section full of comments about allowing a frank learning experience you're being so down on someone giving tips to consider learning from.
That's awesome, but why publicize it? This isn't an act of contrition for you, no one outside your team really needs to see your dirty laundry, and actually comes off as unprofessional to me. The gitlab team is a team, and you take responsibility as a team. Placing names and initials in the liveblog makes it look like SOMEONE is trying to assign and pass off blame, even if that is not what is happening.
Presumably in the coming days there will be a number of team meetings where you discover what went wrong, and what the action items are for everyone moving forward. The public looking info just needs to say what went wrong, how it is being fixed, and what will be done in the future to prevent it from happening again. I don't need names to get that.
On the contrary, it comes off as very professional. All other companies would hide this, they would show off a very cleaned up post-mortem and say "problem solved" and that's it. Ok so what does that mean, does it mean the process will change for the future or that they just fixed it for today?
This is also an awesome advert to see how they work remotely all together and I'm sure they're hiring for DevOps people now ;)
There are quite a few comments on this very thread about how this creates trust, not ruins it.
Thinking that there's a "right" and a "wrong" way (or an "unprofessional"/"dirty laundry" situation) is quite contrarian to the Gitlab transparency model and their culture. If you don't like their culture, don't work there.
IMO the idea of secrecy equating to professionalism is _the_ problem with many things. "Information wants to be free." It's also more personable, especially to those who use their product - to me, it shows they're on top of it, they care and are taking responsibility. Gives you a sense like you're part of the team (or they part of yours).
Keeping names out of incident reports isn't about secrecy. There's nothing stopping the folks at Gitlab posting up a retrospective blog post. Incident reports are formal documents published to let users and customers know what's going on. The names can come later, if all parties are OK with it.
You're assuming the only reason for wanting to do it would be as contrition, but it sounds like that's not the reason here. Possibly the GitLab team cares about transparency to the extent that they simply prefer to be transparent.
I don't know why people on hacker news are against transparency. I'm glad you guys are live streaming this, others would feel too inadequate to do so. Being this transparent only makes me want to use (and contribute to) gitlab even more.
I'm guessing they feel strongly about getting singled out if something like this would happen to them. Possibly because they have been used as a scapegoat by a employer or team mate once.
> I don't know why people on hacker news are against transparency.
> Being this transparent only makes me want to use (and contribute to) gitlab even more.
I hope you'll be there when someone doesn't hire the person responsible for their mistake so you can vouch for them.
You don't have radical transparency because the world is not the understanding meritocracy you think it is. There is no value to the employee for having radical transparency in a post mortem.
That's courageous but this wasn't your mistake, it was the CEO's mistake. They owe you a vacation and apology for putting you in that terrible position!
I'm not sure if it's the CEO's mistake, or any specific individual's mistake for that matter. In this particular case many different problems coalesced, producing one giant problem. If it wasn't me, somebody else would have eventually made the same mistake; perhaps with even greater impact.
Any blame that would be generalized to the company as a whole is also specifically the CEO's fault. The buck has to stop somewhere. That is part of the deal for the big chair/title/paycheck/expense account (at any company, not just Gitlab or in SV).
Exactly. The CEO clearly didn't properly weigh the impact of a failure like this on the reputation of the company. GitLab will lose many customers over this. Unfortunately choices like calling in (or allowing) the original engineer to work on recovery, not protecting the engineer from mistakes of the organization, etc. don't signal recognition of the required corrective action. That corrective action requires the CEO to take full responsibility, protect individual employees from process-driven failure, study the cultural aspects that allowed the failure to occur, etc.
I agree with the CEO being responsible. As I mentioned 15 hours ago in this HN post in https://news.ycombinator.com/item?id=13537245 "the blame is with all of us, starting with me. Not with the person in the arena."
I think you're right that it wasn't a mistake of anyone below the CEO position, but I'm certain that it was a huge mistake on the CEO's part. The customers and employees deserve a huge apology from the CEO. I'll be shocked if without this realization on the leader's part the board doesn't replace them.
Anybody whose opinion matters understands that this type of event is a process problem, not a person problem.
GitLab has always blazed their own trail with their transparency, whether through their open run books, open source code, or in this case their open problem resolution. Kudos to them in whatever manner they want to do it in (with or without names).
To be honest, through all of the comments, yours seems the most high-strung, and you're the one complaining about high-pressure situations like having your boss looking over your shoulder. Relax buddy. :)
In a few years the guy doing the `rm -rf` is going to be on a job interview and someone will recall bits of this report. Enough bits to remember the guy, not enough bits to remember that it wasn't his (individual) fault.
Transparency doesn't mean publicly throwing people under the bus.
Honestly, if I were interviewing the guy, that would almost be a bonus! Like, everyone makes mistakes, we're all human, but I can guarantee you that THAT person will never make that particular mistake ever again. And he's going to be 10 times more diligent than the average engineer in making sure there are good backup/restore procedures.
There's a probably apocryphal story like this about a guy forgetting to refuel a plane. The pilot made sure that guy was solely responsible for refuelling his plane in future, because he knew he'd never forget again.
I run backups on my computer before installing new software/fiddling with important settings/etc. because I've fucked up before.
I'll run backups of phones (or at least verify that they are present) before trying to fix issues on them after nuking my mom's phone which resulted in her losing pictures of my niece and nephew. (Luckily she had sent a lot of those pictures to us via e-mail, but still).
I remember reading the story in "How to Win Friends and Influence People". The plane was filled with jet fuel and had to do an emergency landing, but the pilot was able to save everyone. When he got back he told the mechanic that he wanted him to fill his plane the next day because he knew that he'd never make the same mistake again.
If the pilot can forgive someone for a mistake that almost cost lives, I'm sure any good interviewer can forgive him for a mistake that cost data and will probably never be repeated.
I've heard this anecdote before and it never sat well with me. Forgetting to fuel a plane as a plane mechanic exposes a serious character flaw that could lead to something devastating if allowed to continue (perhaps next time he forgets to oil the engine? Grease the brakes?). Sensationalizing this story could actually do alot of harm. The plane mechanic should have been fired for failing such an important task. If he showed incredible remorse and was responsible enough to own up to his mistakes, he should still have been striped of all his other responsibilities and only fuel planes until he has proven himself enough to take on more responsibilities again.
When people are afraid to loose their jobs if they make an error you can be pretty sure they will do everything in their power to hide the fact that they made an error, which is the exact opposite of the behavior you want. To allow process improvements it must be absolutely clear that errors will not be punished, but used to help everyone to learn.
The Captain basically got up before the NTSB and when asked what happened, he responded "I F__ked Up!" instead of trying to deflect blame onto an unforeseen system glitch or other excuse. Its since been known as the "Asoh Defense"
They also have the NASA ASRS for reporting near misses, and incidents without fear of FAA enforcement.
It must be coupled with processes that guard against errors though. Defense in depth. I'd imagine the pilot has a tick sheet to go over before takeoff and fuel is an item on that sheet.
I think you highly underestimate the number of mistakes like this on the flight line.
By an order of magnitude it sounds like from your comment. Even if you get 99.99% reliability (good luck with humans involved) think of the number of flight movements per day multiplied by the number of tasks that must be completed.
This is why there are redundant checks and checklists and systems in place. To catch human errors, as absolutely everyone in the business will eventually make a trivial yet critical mistake.
Demanding individual human perfection is great, but you'll find you will end up with no workforce.
The aviation industry recognizes and accepts that people make mistakes, and that this is a simple fact of being human. Firing that mechanic without fixing the process wouldn't have done any good in the long run. Someone else would just make the same mistake. Maybe not the following week, maybe not the following month, but eventually, it would've happened again. The right answer is to fix the process.
Agreed. Point in case, the recent death of nearly the entire Chapecoense football team:
> According to the preliminary report, several decisions of the flight crew were incompatible with aviation regulations and rendered the flight unsafe. Insufficient flight planning (disregarding necessary fuel stops) and not declaring an emergency when the fuel neared exhaustion caused the crash.
That guy is going to be interviewing at some company with someone who's obsessive enough about outage reports to remember a then-obscure one years later, but enough of an idiot to not understand that people aren't personally to blame for this sort of stuff?
Sounds like even in that very contrived scenario the guy involved would dodge a bullet in not being hired by a bunch of idiots.
also, maybe some people on here are perfect, but if you've used Unix for more than half your life (as I have) you've 'rm -rf'-ed some stuff.
I think people who've been through disasters have a much better understanding of the importance and methods of not ending up there than those with a perfectly clean record.
IOW, I'd hire the "rm -rf" guy first if he owns it.
Years ago I worked for a University. We lost power in our data centre. No big deal right. Stuff comes back up, you realize which service dependencies you missed, set them to run at startup, change some VM dependency startup order and you're good.
One of the SAN arrays didn't come up, and then started rebuilding itself. Our storage was one of those multi-million dollar contracts from IBM. They flew a guy out to the University and after a lot of work, they said the array was lost and unrecoverable.
Backups for production for some VMs were on virtual tape .. on the same shelves as production. O_o
At least a lot of our clusters were split between racks, so in many cases we could just clone another one. We learned that MS BizSpark, in a cluster, only puts the private key on half the machines. We had to recreate a bunch of BizSpark jobs based off what we could still see in the database and our old notes and password vaults. We had been planning on upgrading to a newer version of BizSpark on a Server 2012 (it was on 2003), so this kinda forced us to. Shortly afterwards we learned how to make powershell scripts to backup those jobs and test the backups by redeploying them to lower environments.
The sys admin over the backups was looking for a new job. You can't really fire people from universities easily, because it's very difficult to find IT staff that will take university wages. Word was out though, if he didn't find new work, he was going to be let go. Not laid off, made redundant, or have his position removed. He would be fired.
You don't want to work for a company that has that attitude anyway, honestly. That shows they have a poor attitude towards problems and probably will overreact to things like missing deadlines or pursuing a solution that ends up not working, etc.
You'll never know if it was a great company with a bad interviewer. It's better to use all the advantages you can to get through an interview and get the job to see for yourself. I don't think you can learn anything definitive from most interviews - they're mostly subjective, unscientific voodoo.
knowing exactly how a potential employee handles an error he might have caused? This guy is going to be fighting off job offers, if he hasn't already been.
Good! I'd like to talk about what the engineer learned from the experience. Certainly if trawling through someone's public repos and records turns up a pattern of repeated mistakes, that should be considered - but the mistakes we all make from time to time are chances to learn.
So what I'd be interested in seeing is if the candidate did learn. The mistake is less important than the candidate demonstrating they moved past it as a stronger developer.
On the flip side - given a choice in situation, I'd prefer not to work for a place that dredges up my old bugs and uses them in isolation as a basis for their decision. That suggests the kind of environment I wouldn't enjoy being in.
> Anybody whose opinion matters understands that this type of event is a process problem, not a person problem.
That how the world should be. Not how it is.
Yes, someone with hiring/firing ability might blame the individual, and you could claim "Oh, you shouldn't listen to them, they're an idiot". But that's not much comfort if you're out of a job and gonna be kicked out of your house. In that situtation, the idiot with hiring/firing power matters to your life a lot.
I completely agree. Trying to get low level details out to the public while in the heat of the issue is a misstep; you can still be transparent while not risking over communication that could haunt you later.
While I think most of the HN audience understands that some days you have a bad day and that sometimes very small actions, like typing a single command at a prompt, can have dramatic consequences in technology, there are nonetheless less enlightened souls in hiring positions that simply might find fragments of this in a search on the name when that time comes.
Being too transparent could also encourage legal problems, too, if someone decides that they had a material loss over this, at least for the company. Terms of service or likelihood of a challenge prevailing doesn't necessarily matter: you can be sued for any reason and since there's no loser pay provision in any U.S. jurisdiction that I know of, even a win in court could be very costly. Being overly transparent in a case like this can bolster a claim of gross negligence (justified or not) and the law/courts/judges/juries cannot be relied upon to be consistently rational or properly informed.
Part of the problem is that this isn't actually a postmortem: they're basic live blogging/streaming in real time. What would be helpful for us (users) and them (GitLab) in terms of real-time transparency:
* Acknowledge there were problem during a maintenance and data may/may not have been lost.
* If some data is known to be safe: what data that is.
* What stage are we at. Still figuring it out? Waiting for backups to restore? Verification?
* Broad -estimated- time to recovery: make clear it's a guess. Even coarsely: days away, 10's of hours away, etc.
* When to expect the next public update on the issue.
None of this needs to be very detailed and likely shouldn't include actual technical detail. It just needs to be honest, forthright, and timely. That meets the transparency test while also protecting employees and the company.
Later, when there is time for thoughtful consideration, a technical postmortem at a good level of detail is completely appropriate.
Modern companies tend to have people in roles and teams. What I've done in postmortems is to use role names and team names, not person's names. Even if the team is just one person. This helps keep it about the team and the process. We're all professionals doing our best and striving for continuous improvement.
The person who made the error is just the straw that broke the camels back. I'm sure these folks knew that they needed to prioritize their backups but other things kept getting in the way. You don't throw people under the bus.
Am I missing something? Where in this report are any individuals actually named? My understanding was that they're using initials in place of names specifically because they want to _avoid_ naming anyone.
The original versions of the document had names. Those were later replaced with initials.
I think the issue was in part that this document didn't appear to be a public "here's what's going on doc" as much as it was a doc they seemed to be using as a focal point for their own coordination efforts.
I'm a huge Gitlab fan. But I long ago lost faith in their ability to run a production service at scale.
Nothing important of mine is allowed to live exclusively on Gitlab.com.
It seems like they are just growing too fast for their level of investment in their production environment.
One of the only reasons I was comfortable using Gitlab.com in the first place was because I knew I could migrate off it without too much disruption if I needed to (yay open source!). Which I ended up forced to do on short notice when their CI system became unusable for people who use their own runners (overloaded system + an architecture which uses a database as a queue. ouch.).
Which put an end to what seemed like constant performance issues. It was overdue, and made me sleep well about things like backups :).
A while back one of their database clusters went into split brain mode, which I could tell as an outsider pretty quickly... but for those on the inside, it took them a while before they figured it out. My tweet on the subject ended up helping document when the problem had started.
If they are going to continue offering Gitlab.com I think they need to seriously invest in their talent. Even with highly skilled folks doing things efficiently, at some point you just need more people to keep up with all the things that need to be done. I know it's a hard skillset to recruit for - us devopish types are both quite costly and quite rare - but I think operating the service as they do today seriously tarnishes the Gitlab brand.
I don't like writing things like this because I know it can be hard to hear/demoralizing. But it's genuine feedback that, taken in the kind spirit is intended, will hopefully be helpful to the Gitlab team.
Hey Daniel, I want to thank you for your candid feedback. Rest assured that this sort of thing makes it back to the team and is truly appreciated no matter how harsh it is.
You're absolutely right -- we need to do better. We're aware of several issues related to the .com service, mostly focused on reliability and speed, and have prioritized these issues this quarter. The site is down so I can't link directly, but here's a link to a cached version of the issue where we're discussing all of this if you'd like to chime in once things are back up: https://webcache.googleusercontent.com/search?q=cache:YgzBJm...
I'm running a remote-only company and we moved to GitLab.com last summer from cloud hosted trac+git/svn combo (xp-dev). The reason we picked GitLab.com was because the stack is awesome and Trac is showing its age. We also wanted a solution that could be ran on premises if needed. We spent about a month migrating stuff over to GitLab from Trac. Once we were settled the reliability issues started to show. We were hoping that these would be quickly sorted out given the fact that the pace of the development with the UI and features was quite speedy.
A sales rep reached out and I told him we would be happy to pay if that's required to be able to use the cloud hosted version reliably but I got no response. Certainly we could host GitLab EE or CE on our own but this is what we wanted to avoid and leave it to those who know it best.
xp-dev never ever had any downtime longer than 10 minutes that we actively used during the last 6 years. I'm still paying them so that I can search older projects as the response time is instant while gitlab takes more than 10 seconds to search.
Besides the slow response times and frequent 500 and timeout errors that we got accustomed to, gitlab.com displays the notorious "Deploy in progress" message every other day for over 20-30 minutes preventing us from working. I really hoped that 6-7 months would be enough time to sort these problems out but it only seems to be worsening and this incident kinda makes it more apparent that there are more serious architectural issues, i.e. the whole thing running on one single postgresql instance that can't be restored in one day.
We have one gitlab issue on gitlab.com to create automated backups of all our projects so that we could migrate to our own hosted instance (or perhaps github) but afair gitlab.com does not support exporting the issues. This currently locks us into gitlab.com.
On one hand I'm grateful to you guys because of the great service as we haven't paid a penny, on the other hand I feel that it was a big mistake picking gitlab.com since we could be paying GitHub and be productive instead of watching your twitter feed for a day waiting for the postresql database to be restored. If anyone can offer a paid hosted gitlab service that we could escape to, I'd be curious to hear about.
Meant to mention this earlier: Gitlab self-hosted actually has a built-in importer to import projects from Gitlab.com - including issues.
It's mostly worked reliably in my experience (it's only failed to import one project across the various times I've used it, and I didn't bother debugging because for that import we really only needed the git data).
I'm a bit curious here. Do you think that your issues with scalability and reliability have to do with your tech choice (I think it was Ruby on Rails)? Don't want to bash Rails, I'm just genuinely curious, since I come from a Rails background as well and have seen issues similar to yours in the past.
It's not just the tech stack, but a combination of the technical choices made and with the human procedures behind them. We're actively pushing towards getting everybody to focus on scalability, but there's still a lot of debt to take care of.
Just looking at their gemfile is rather telling: a couple hundred gems. I've always felt that if you're going above 100, you should carefully consider how much your codebase is trying to achieve.
They're probably at the point where they really want to think about splitting off of their monolith codebase and into microservices.
Maybe it's because I'm familiar with almost all of the gems, but I don't see anything wrong with their Gemfile. It's a pretty complex project, and they really do have a ton of integrations and features that need those gems.
There's probably a few small libraries that they could have rewritten in a few files (never a few lines), but what's the point? The version is locked, and code can always be forked if they need to make changes (or contribute fixes).
You'd be surprised what you can do by carefully considering what the desired outcome actually needs to be.
Maybe there is justification for all the gems in gitlab's Gemfile, I didn't go through it with a fine tooth comb - but this reaffirms my experience that complex projects outgrow monolith codebases. Having an infrastructure outage take down your entire business is kind of a symptom of that.
> I've always felt that if you're going above 100, you should carefully consider how much your codebase is trying to achieve.
This is a mindset issue. Some communities reject NIH so strongly that you get the opposite problem that everything depends on hundreds of different developers. Gitlab can start some library forks with more stuff integrated, or change communities. Microservices is something that can't help, as all the dependencies will stay just where they are (Gitlab is already uncoupled to some extent).
But, anyway, most of those are stable¹, and I doubt many of Gitlab problems come from dependencies.
1 - They are unbelievably stable for somebody coming from the Python world. When I first installed Gitlab, I couldn't believe on how easy it was to get a compatible set of versions.
I see the opposite of NIH especially in the RoR/Ruby world and I don't think it's always a good thing. Developers reach for a library for one piece of functionality in a discrete area of the codebase when they could have achieved the same functionality with a few lines of code. That's not automatically NIH, that's being pragmatic about the dependencies you're bringing in and are going to need to support moving forward.
It is fairly large, but I still find it more organized than some examples I've seen.
Also, I don't see another very common issue with big gemfiles in that they don't seem to have multiple solutions of one thing in there (ie multiple REST clients, DB mockers, etc).
I've considered setting up gitlab locally, and have a couple of students that are trying to set it up on a vps. Customizing their bundle installer is... an interesting learning experience in managing complex * nix servers.
I think it's telling that their standard offering/suggestion for self-hosters is as complex as it is. While on the one hand I applaud the poor soul that maintains the script that tries to orchestrate five(?) services on a general, random, unix/linux server without any knowledge/assumption on what other things are running there -- it unsurprisingly falls over in "interesting" ways when you try to do radical stuff like install it on a server that runs another copy of nginx with various vhosts etc.
Now, running services like gitlab at "Internet scale" is far from trivial - but running it at "office scale" should be.
I fully understand how gitlab ended up where they are - but ideally, the self-host version should just need to be pointed at a postgresql instance, and be more or less a "gem install gitlab" -- or similar away - popping up with some ruby web-server on a high port on localhost -- and come with a five-line "sites"-config for nginx and apache for setting up a proxy.
I really don't mean to complain - it's great that they try to provide an install that is "production ready" -- but if the installer reflects the spirit of how they manage nodes on the gitlab.com side -- I'm surprised they manage to do any updates at all with little down-time...
For now I'm running gogs - and it seems to be more of a "devops" developed package - where deployment/life-cycle has been part of the design/development from the start. Single binary, single configuration file. Easily slips in behind and plays well with simple http proxy setups.
At some point I'll find a day or two to migrate our small install to gitlab (we could use the end-user usability and features) -- but I know I'll need to have some time for it. Time to migrate, time to test the install, time to test disaster-recovery/reinstall from backup... all those steps are slowed down and become more complex when the stack is complex.
(I'll probably end up letting gitlab have a dedicated lxc container, although I'll probably at least try to figure out how to reliably use an external postgres db -- it pains me to "bundle" a full fledged RDBMS. These things are the original "service daemons", along with network attached storage and auth/authz (LDAP/AD etc)).
It might be. I'm not saying it's impossible to scale Rails. It's just very, very hard. Github can do this, because they probably get the best of the best engineers. They even used to have their own, patched Ruby version.
Why do you question Rails while the entire report is about Postgres only ?
And as someone working on one of the biggest and oldest Rails codebase out there, I can tell you that in term of scaling, Rails is the least of our concerns.
Sure it's not as efficient, so it's gonna cost you more in CPU and RAM, but it's trivial to scale horizontally. The real worry are the databases, they are fundamentally harder to scale without tradeoffs.
As for the patched Ruby, we used to have one too (but our patches landed upstream so now we run vanilla). It's not about allowing to scale at all. It's simply that once you reach a certain scale, it's profitable to pay a few engineers to improve Ruby's efficiency. If you have 500 app servers, a 1 or 2% performance gain will save enough to pay those engineers salary.
Depending on hundreds of gems means you are depending on the decisions of hundreds of developers with packages which are in constant churn.
Apps like Gitlab and Discourse that depend on hundreds of gems and require end users to have complex build environment and compile software are I think operating a broken user hostile model.
The potential for compilation failures, version mismatches and Ruby oddities like RVM is so gigantic with hundreds of man hours wasted one is left to conclude they may actually want to run a hosting business and not have users deploy themselves.
Compare that to Go or even PHP where things are so orders of magnitude simper that it is not even the same thing. To deal with this complexity you now have containers but have you solved the complexity or added another layer of complexity? There are technical but I think also social factors at play here.
I don't think it's that. GitLab IS a complex setup and Rails is not helping making it simple. There is a ton going on in the stack and the company only has limited resources.
It's not hard to scale a Rails server, when compared to other frameworks and languages. It's exactly the same as scaling a server written in Java, Node.js, Python, or any other language. You just spin up more machines and put them behind a load balancer.
Yes, Ruby is slower than other programming languages, but this usually doesn't matter. If you are charging people to use your software, or even if you are serving ads, you will always be making money before you need a second server. Plus, Rails is super productive, so you'll be able to build your product much faster.
I'm not sure why GitHub used a patched Ruby version, but no, that's not necessary.
Having said all of that, I'm moving towards Elixir and Phoenix. Not just because of the performance, but also because I really like the language and framework.
I have searched the gitlab website and repositories looking for processes and procedures addressing change management, release management, incident management or really anything. I have found work instructions but no processes or procedures. Until you develop and enforce some appropriate processes and the resulting procedures I'm afraid you will never be able to deliver and maintain an enterprise level service.
Hopefully this will be the learning experience which allows you to place an emphasis on these things going forward and don't fall into the trap of thinking formal processes and procedures are somehow incongruent with speedy time to market, technological innovation or in conflict with DevOps.
Like you, I would like to add my 2 cent, which I hope will be taken positively, as I would like to see them provide healthy competition for GitHub for years to come.
Since GitLab is so transparent about everything, from their marketing/sales/feature proposals/technical issues/etc., they make it glaringly obvious, from time to time, that they lack very fundamental core skills, to do things right/well. In my opinion, they really need to focus on recruiting top talent, with domain expertise.
They (GitLab) need to convince those that would work for Microsoft or GitHub, to work for GitLab. With their current hiring strategy, they are getting capable employees, but they are not getting employees that can help solidify their place online (gitlab.com) and in Enterprise. The fact that they were so nonchalant about running bare metal and talking about implementing features, that they have no basic understanding of, clearly shows the need for better technical guidance.
They really should focus on creating jobs that pays $200,000+ a year, regardless of living location, to attract the best talent from around the world. Getting 3-6 top talent, that can help steer the company in the right direction, can make all the difference in the long run.
GitLab right now, is building a great company to help address low hanging fruit problems, but not a team that can truly compete with GitHub, Atlassian, and Microsoft in the long run. Once the low hanging fruit problems have been addressed, people are going to expect more from Git hosting and this is where Atlassian, GitHub, Microsoft and others that have top talent/domain expertise, will have the advantage.
Let this setback be a vicious reminder that you truly get what you pay for and that it's not too late to build a better team for the future.
> They really should focus on creating jobs that pays $200,000+ a year, regardless of living location
For those who haven't been following along, Gitlab's compensation policy is pretty much intentionally designed to not pay people to live in SF. It's a somewhat reasonable strategy for an all remote company. But they seem to have some pretty ambitious plans that may not be compatible with operating a physical plant.
I would point you to some very ambitious feature proposals on their issue tracker, but I can't for obvious reasons. I think GitLab is at a cross roads and this setback might be the eye opener they need. Moving forward, they really need to re-evaluate how they develop and evolve GitLab. For both online and Enterprise.
This idea of releasing early and on the 22nd works very well for low hanging fruits problems, but not for the more ambitious plans they have. If they understood the complexity for some of the more ambitious plans, they would know they are looking at, at least a year of R&D to create an MVP.
I think it makes sense to keep doing the release on the 22nd, but they also need to start building out teams that can focus on solving more complex problems that can take months or possibly a year to see fruition. Git hosting has reached a point, where differentiating factors can be easily copied and duplicated, so you are going to need something more substantive, to set yourself apart from the rest. And this is where I think Microsoft may have the upper hand in the future.
> I think GitLab is at a cross roads and this setback might be the eye opener they need. Moving forward, they really need to re-evaluate how they develop and evolve GitLab.
Judging by their about team[1] page, they are currently short an Infrastructure Director. When you read their job listings, even for DBAs and SREs, it' all "scale up and improve performance." Very little "improve uptime, fight outages." One assumes it's upper management approving the job descriptions, so the missing emphasis on uptime, and redundancy probably pervades the culture. And again, judging by the team profile, they've hired very few DBA / SRE experts, and instead appear to have assigned Ruby developers to the tasks.
Perhaps they simply have to bet the farm on scaling much larger to sustain the entire firm, which is troubling for enterprise customers, and for teams like mine running a private instance of the open source product. Should probably review the changelog podcast interview[2] with the CEO and see if any quotes have new meaning after today.
> Gitlab's compensation policy is pretty much intentionally designed to not pay people to live in SF.
What do you mean? They pay people in SF much more than in other cities because the high cost of living. I'd consider working for Gitlab if I would live in SF, living in Berlin it's not an option.
Look, I love GitLab. Gitlab was there for me when both my son and I got cancer, and they were more than fair with me when I needed to get healthy and planned to return to work. I have nothing but high praises for Sid and the Ops team.
With that said, I'll agree that the salary realities for GitLab employees are far below the base salary that was expected for a senior level DevOps person. I've got about 10 years experience in the space, and the salary was around $60K less than what I had been making at my previous job. I took the Job at GitLab because I believe in the product, believe in the team, and believe in what Gitlab could become...
With that said, starting from Day 1, we were limited by an Azure infrastructure that didn't want to give us Disk iops, legacy code and build processes that made automation difficult at times, and a culture that proclaimed openness, but, didn't really seem to be that open. Some of the moves that they've made (Openshift, rolling their own infrastructure, etc) have been moves in the right direction, but, they still haven't solved the underlying stability issues -- and these are issues that are a marathon, not a sprint. They've been saying that the performance, stability, and reliability of gitlab.com is a priority -- and it has been since 2014 -- but, adding complexity of the application isn't helping: if I were engineering management, I'd take two or three releases and just focus on .com. Rewrite code. Focus on pages that return in longer than 4 seconds and rewrite them. When you've got all of that, work on getting that down to three seconds. Make gitlab so that you can run it on a small DO droplet for a team of one or two people. Include LE support out of the box. Work on getting rid of the omnibus insanity. Caching should be a first class citizen in the Gitlab ecosystem.
I still believe in Gitlab. I still believe in the Leadership team. Hell, if Sid came to me today and said, "Hey, we really need your technical expertise here, could you help us out," I'd do so in a heartbeat -- because I want to see GitLab succeed (because we need to have quality open source alternatives to Jira, SourceForge Enterprise Edition, and others).
Not trying to be combative, but, "You truly get what you pay for" seems a little vindictive here -- the one thing that I wish they would have done was be open with the salary from the beginning -- but, Sid made it very clear that the offer that he would give me was going to be "considerably less" than what I was making.
> They really should focus on creating jobs that pays $200,000+ a year, regardless of living location, to attract the best talent from around the world. Getting 3-6 top talent, that can help steer the company in the right direction, can make all the difference in the long run.
SIGN ME UP! That would be a freaking great opportunity!!
Yup - top talent is already making more. Gitlab needs to recruit with purpose (this is what we're doing and why), environment (remote first, transparency, etc), and pay (we can match 70% of what you'd get at XYZ Company). Right now, it feels like they're capped at 30-50% of what someone could make at a big org, which is just a drop in salary most people would never take, regardless of the company values/purpose.
One alternate idea would be to hire consultants on a temporary basis. You may not be able to pay $250k a year, but you could pay a one time $40k fee to review the architecture and come up with prioritized strategy for disaster recovery and scalability.
Why would they try to recruit from Microsoft? Most of the software engineers at Microsoft are not focused on developing scalable web services architectures. And the ones that do have built up all of their expertise with Microsoft technologies (.net running on Windows server talking to mssql).
>Microsoft and others that have top talent/domain expertise, will have the advantage.
Again, Microsoft isn't even in this same field (git hosting) or if they are, are effectively irrelevant due to little market/mindshare. Are you an employee there or something?
> Most of the software engineers at Microsoft are not focused on developing scalable web services architectures.
Uh, MS literally runs Azure, which may not be the biggest IAAS offering, but is certainly vastly larger and more complex than Gitlab. There are certainly numerous engineers at MS who would have experience relevant to Gitlab (though perhaps not with their particular tech stack). It may not be most of the engineers there, but in a company with literally tens of thousands of engineers, there are few things that will be true of most of them.
> Microsoft isn't even in this same field (git hosting)
How is what they're hosting at all relevant to the problem at hand? This could have happened regardless of what the end product was - it's a database issue. In fact, the git infrastructure was explicitly not involved in this issue - it was only their DB-backed features that had data loss.
Additionally, Microsoft is in the business of git hosting, if only tangentially. TFS supports git, and has since 2013: https://blogs.msdn.microsoft.com/mvpawardprogram/2013/11/13/... Your objection is both unkind and factually incorrect. The "mindshare" comment is a bit silly - even though they may not be as active on forums like HN, developers working on MS technologies are still one of the largest groups in programming (as a non-MS developer looking for work in the Pacific Northwest, this is something I'm constantly reminded of). I doubt your estimate of Microsoft's real mindshare is anything close to accurate.
> Are you an employee there or something?
This accusation is eminently not in the spirit of HN, and Microsoft was hardly the only company he mentioned. Whatever your personal vendetta against them, it's absurd to think that Microsoft is not one of the top pools of talent in tech - they're a huge company with a vast variety of offerings and divisions.
I'm not sure if you read my post correctly, but I never mentioned poaching from Microsoft. I said compete for programmers that would choose to work for Microsoft. I'm also not sure if you understand what Microsoft does. It's a very diverse company with R&D spending that rivals some small nations.
> Microsoft isn't even in this same field (git hosting
Microsoft understands Enterprise and it's quite obvious they want to be a major provider for Git hosting. It will be foolish to believe Microsoft is not focused on owning the Git mindshare in Enterprise.
> Are you an employee there or something
No. Just somebody that understands this problem space.
One of the main drivers of revenue for Microsoft is Office 365, with 23.1 million subscribers[0]. Along with Azure, MS runs some of the largest web services around. Most developers at MS don't necessarily work on these products, but to say that all the devs working on them use a simple .NET stack + SQL Server is discrediting a lot of work that they do.
Disclaimer: I work for Microsoft in the Office division and opinions are my own
Hey there, honest question incoming. Any chances of you chaps making Word a better documentation tool in the future? Edit history storing formatting and data changes on the same tree is making it impossible to use Word for anything serious. This really comes to light once you start working at an MS tech company on documentation, where it is obvious that you should use MS products for work. Some tech writers I know just end up using separate technology branches for their group efforts, since neither Sharepoint nor Word is a professional tool for this job.
>us devopish types are both quite costly and quite rare - but I think operating the service as they do today seriously tarnishes the Gitlab brand.
The sad thing is it doesn't have to be this way. Software stacks and sysadmin is out there for the learning, but due to the incentives of moving jobs every two years, nobody wants to invest to make those people, we all know we'l find /someone/ to do it anyways.
I think they are running to catch up on the gitlab system itself, let alone running it as a production service. The bugs in the last few months have been epic. Backups not working, merge requests broken, chrome users seeing bugs, chaotic support. Basically their qa and release processes are not remotely enterprise ready.
If I understand correctly, the public Gitlab is similar to what you can get with a private Gitlab instance. That makes me wonder, instead of trying to scale the one platform up, would it be OK to spin up a second public silo? I mean yeah, it would be a different silo, but for something free I'd say "meh".
I think it's totally fine admitting when you've stopped being able to scale up, and need to start scaling out.
They could, and as a stopgap measure that might work, but..
(1) Some of the collaboration features (e.g. work on Gitlab itself) depend on having everyone on the same instance.
(2) Gitlab.com gives them a nice dogfood-esque environment for what it's like to actually operate Gitlab at scale. If they are having problems scaling it, then potentially so are their customers. Fixing the root cause is usually a good thing and is often an imperative to avoid being drowned in technical debt.
(3) It moves the problem in some respects. Modern devops techniques mostly allow the number of like servers to be largely irrelevant, but still.. the more unique instances of Gitlab, the more overhead there will be managing those instances (and figuring out which projects/people go on which instances).
It's a simple approach which I'm sure would work, but it also means a bunch of new problems are introduced which don't currently exist.
>1. LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage
>2. Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.
>3. SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.
>4. Disk snapshots in Azure are enabled for the NFS server, but not for the DB servers.
The synchronisation process removes webhooks once it has synchronised data to staging. Unless we can pull these from a regular backup from the past 24 hours they will be lost
The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented
>5. Our backups to S3 apparently don’t work either: the bucket is empty
>So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.
Sounds like it was only a matter of time before something like this happened. How could so many systems be not working and no one notice?
What if I told you all of society is held together by duct tape? If you're surprised that startups cut corners you're in for a rude awakening. I'm frequently amazed anything works at all.
True. I watched a 60 year old manufacturing plant shut down for 7 days once because someone saved an excel spreadsheet in the wrong format and it deleted a macro written more than a decade prior that held together all of their operations planning.
> Websites that are glorified shopping carts with maybe three dynamic pages are maintained by teams of people around the clock, because the truth is everything is breaking all the time, everywhere, for everyone. Right now someone who works for Facebook is getting tens of thousands of error messages and frantically trying to find the problem before the whole charade collapses. There's a team at a Google office that hasn't slept in three days. Somewhere there's a database programmer surrounded by empty Mountain Dew bottles whose husband thinks she's dead. And if these people stop, the world burns.
And if you think that paragraphs does not apply to your applications / company, ask yourself if that's really true. My company sends around incident statistics and there's always some shit that broke. Always.
I don't think a lot of things work at all. We who live in developed countries are the minority really. Most things are simply non-functional and downright ugly. Such complex systems the world has. And now apparently even the developed worlds are in for some troubles. The peace and progress in the last few decades just really isn't the norm in human history.
That said, one can't deny that there are indeed things that do work, and work very well, and people who make that happen, and one can always be amazed/inspired by those. There are good things as well as haphazard things. It's just that the latter generally outnumber the former in many settings. It doesn't necessarily imply a sweeping statement about everything though.
It's called the Fundamental Failure-Mode Theorem - "Complex systems usually operate in an error mode". https://en.wikipedia.org/wiki/Systemantics has more rules from the book. It's worth the read.
if #2 is correct, holy shit did gitlab get lucky someone snapshotted 6 hours before.
Dear you: it's not a backup until you've (1) backed up, (2) pushed to external media / s3; (3) redownloaded and verified the checksum; (4) restored back into a throwaway; (5) verified whatever is supposed to be there is, in fact, there, and (6) alerted if anything went wrong. Lots of people say this, and it's because the people saying this, me included, learned the hard way. You can shortcut the really painful learning process by scripting the above.
Do you have to download the entire backup or is a test backup using the same flow acceptable? I'm thinking about my personal backups, and I don't know if I have the time or space to try the full thing.
For DB backups, until you've actually loaded it back into the DB, recovered the tables, and tested a couple rows are bit identical to the source, it's a hope of a backup not a backup. Things like weird character set encodings can cause issues here.
If time and space for the full thing are an issue, it could be really important to get going after an incident to be able to recover the most important bits first.
I'm guessing they all worked at some point in time, but they failed to set up any sort of monitoring to verify the state of their infrastructure over time.
If you're a sys admin long enough, it will eventually happen to you that you'll execute a destructive command on the wrong machine. I'm fortunate that it happened to me very early in my career, and I made two changes in how I work at the suggestion of a wiser SA.
1) Before executing a destructive command, pause. Take your hands off the keyboard and perform a mental check that you're executing the right command on the right machine. I was explicitly told to literally sit on my hands while doing this check, and for a long time I did so. Now I just remove my hands from the keyboard and lower them to my side while re-considering my action.
2) Make your production shells visually distinct. I setup staging machine shells with a yellow prompt and production shells with a red prompt, with full hostname in the prompt. You can also color your terminal window background. Or use a routine such as: production terminal windows are always on the right of the screen. Close/hide all windows that aren't relevant to the production task at hand. It should always be obvious what machine you're executing a commmand on and especially whether it is production. (edit: I see this is in outage the remeditation steps.)
One last thing: I try never to run 'rm -rf /some/dir' straight out. I'll almost always rename the directory and create a new directory. I don't remove the old directory till I confirm everything is working as expected. Really, 'rm -rf' should trigger red-alerts in your brain, especially if a glob is involved, no matter if you're running it in production or anywhere else. DANGER WILL ROBINSON plays in my brain every time.
Lastly, I'm sorry for your loss. I've been there, it sucks.
YP thinks that perhaps pg_basebackup is being super pedantic about there being an empty data directory, decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com
Good lesson on the risks of working on a live production system late at night when you're tired and/or frustrated.
Also, as a safety net, sometimes you don't need to run `rm -rf` (a command which should always be prefaced with 5 minutes of contemplation on a production system). In this case, `rmdir` would have been much safer, as it errors on non-empty directories.
Or even instead of any kind of rm command. mv is less subtle. I tend to prefer
`mv x $(date +%Y%m%d_%s_)x`
where:
%Y - 4 digit year
%m - 2 digit month
%d - 2 digit day
_ - underscore literal
%s - linux timestamp (seconds since epoch)
This ensures that the versions you're 'removing' will be lexically sorted from newest to oldest in a way that is easy to interpret and also works if you need to try more than once in a day.
In case it's not apparent to some, this command moves the directory (or file) called
'x'
to something like
'20170131_1485916040_x'
Then when you're all done (i.e. production is humming and passing tests, no need to ever rush), you can delete the timestamped version, or if space is plentiful, move the old file to an archive directory as an extra redundancy (i.e. as an extra backup, not in lieu of a more thorough backup policy).
This needs to be the first thing anyone who works with stateful systems learns. NEVER rm. mv is insufficient. mv dir dir.bak.`date +%s` has prevented data loss for me several times.
I agree with what you're saying, and this is almost exactly what I do, but when disk space is limited - particularly during time-sensitive situations - this advice isn't very useful. For example, if a host or service is on the cusp of crashing because of a partition quickly filling up with rolling logs, what do you do since mv doesn't actually solve the problem?
At some level you have to run an rm, and you better hope you do it right in the middle of an emergency with people breathing over your shoulder.
In an ideal world, this wouldn't ever happen, but it does. Inherited/legacy systems suck.
You shouldn't let yourself get to that point. Your alerting system should alert you when disk is at 70% or something with a ton of margin. If it's not set up that way, go stop what you're doing and fix that. (Seriously.) If you're running your systems so they usually run at 90% disk usage, go give them more disk (or rotate logs sooner).
And even assuming all that fails and I'm in that situation where I have seconds until the disk hits 100%, I would much rather the service crashes than make a mistake and delete something critical.
If someone is breathing over your shoulder, you can even enlist them to co-pilot what you're doing. Even if they're not technical enough to understand, talking at them what you're about to do will help you spot mistakes.
If you lose a shoe while running across a highway, it's probably not worth the risk trying to get it back.
The disk usage is just one example of many. Actions that you take really depends on what field you're in and, and again, legacy/inherited systems are completely filled with this sort of shit.. You can say that I shouldn't let get it to that point, but you're kinda dismissing the point I'm trying to make that - shit can and will happen, and you need to know how to deal with shit on your toes. There are times when you have to do things that would make most people flip there shit. There are ways to mitigate the risk in emergency scenarios as you say, but when the risk is actually worth it, you tend to do Bad Things because there's no other option.
In my case, it was in HFT where I inherited the infrastructure from a JS developer who inherited it from a devops engineer who inherited it from a linux engineer who inherited it from another linux engineer. It was a complete shitshow that I was dropped into mostly on my own with little warning. To make matters worse, each maintenance period was 45 minutes at 4:15pm and weekends. Even worse, if a server went down at 5:00pm, the company immediately lost about 35k - which was the same for if the trading software went down. When I asked for additional hardware to do testing on, I was told that there wasn't a budget for it. The saving grace there was that there was 23:15h of time to plan during downtime, so an `rm -rf /` would have had nearly identical long term impact as a `kill -9` on the application server.
Mind you, the owners of the company were some of the smartest and most technical folk I've ever worked with and were surprisingly trusting in my ability to manage the infrastructure. The company no longer exists, and not without reason.
Just to show the lunacy of the infrastructure - they had their DNS servers hosted on VMs that required DNS to start. About a month after I joined, we had a power failure. You can imagine how that went..
(That all said, it was the greatest learning experience I've ever had. Burned me out a tad, though.)
I don't mean to sound flippant, but if I walked into the situation you describe, I would immediately walk right back out. That is a a situation set up from the start for failure, and no way would I care to be responsible for it.
I certainly get that we inherit less-than-ideal systems from time to time; I've been there. But I've also learned that every time I get paged in the middle of the night, it's my failure, whether for a lack of an early-warning system, or for doing a bad up-front job of building self-healing into my systems. If I inherit a system that I can tell is going to wake me up at night, I refuse to be responsible for it in an on-call capacity until I've mitigated those problems.
There seems to be this weird thing in the dev/ops world where it's somehow courageous to be woken up at 3am to heroically fix the system and save the company. I've been that guy, and I'm sick of it. It's not heroic: it's a sign of a lack of professionalism leading up to that point. Make your systems more reliable, and make them continue to chug along in the face of failure, without human interaction. If you have management that doesn't support that approach, make them support it, or walk out. Developers and operations people are in high enough demand right now in most markets that there will be another company that would love to have you, hopefully with more respect for your off-duty time.
I once tried to help a company with similar infrastructure insanity recover from a massive failure. Absolutely brutal.
When my team finally got services up and running (barely) after ~18 hours of non-stop work, the CTO demanded that we not go home and get some sleep until everything was exactly as it had been before the failure.
Unfortunately the answer is "it depends on the application". I tend to run stuff with even higher margins: I never expect more than 30-40% disk utilization. Yes, it's more expensive, but I value my (and my colleagues') sleep more.
But it's all just about measurement. Run your application with a production workload and see how large the logs are that it generates within defined time intervals. Either add disk or reduce logging volume until you're happy with your margins. (Logging is often overlooked as something you need to design, just like you design the rest of your application.)
Log rotation should be a combination of size- and time-based. You probably want to only keep X days of logs in general, but also put a cap on size. If you're on the JVM, logback, for example, lets you do this: if you tell it "keep the last 14 log files and cap each log file at 250MB", then you know what the max disk usage for logging will be.
If you can do it, use an asynchronous logging library that can fail without causing the application to fail. If your app is all CPU and network I/O, there's no reason why it needs disk space to function properly. If you can afford it, use some form of log aggregation that ships logs off-host. Yes, you've in some ways just moved the problem elsewhere, but it's easier to solve that once in one service (your log aggregation system) than in every individual service.
If your app does require disk space to function properly, then of course it's a bit harder, and protecting against disk-full failures will require you to have intimate knowledge of what it needs disk for, and what the write patterns are.
It's never going to be perfect. Just a 100% uptime is, over a longer time scale, unachievable, you're never going to eliminate every single thing that can get you paged in the middle of the night. But if you can reduce it to that one-in-a-million event, your time on-call can really be peaceful. And when you do get that page, look really hard at why you got paged, and see what you can do to ensure that particular thing doesn't require human intervention to fix in the future. You may decide to cost of doing so isn't worth the time, and that getting woken up once every X days/weeks/months/whatever is fine. But make that your choice; don't leave it up to chance.
I'm curious, too. Proper disk space management and monitoring is probably the most difficult problem I know of in the ops field. I haven't seen anybody do it in a way that prevents 3am wakeup calls or a 24/7 ops team.
For example, a 3am network blip that causes the application server (still logging in DEBUG from the last outage) to fill up its log partition while it can't communicate to some service nobody monitors anymore. Not sure how you'd solve that one.
> For example, a 3am network blip that causes the application server (still logging in DEBUG from the last outage)
Nope. Don't do that. Infra should be immutable. If you need to bring up a debug instance to gather data, that's fine, but shut it down when you're done. If you don't, and it causes an issue, you know who to blame for that.
> to fill up its log partition while it can't communicate to some service nobody monitors anymore.
Sane log rotation policies (both time- and size-based) solves this. If you tell your logging system "keep 14 old log files and never let any single log file to grow above 250MB", then you know the upper bound on the space your application will ever use to log.
Also, why are you not monitoring logs on this service. If it's spewing "ERROR Can't talk to service foo" into its log file, why aren't you being alerted on that well before the disk fills up?
> ... nobody monitors anymore.
Nope. Not allowed. Fix that problem too. Unmonitored services aren't allowed in the production environment, ever.
I've heard (and given) all the excuses for this, but no, stop that. You're a professional. Do things professionally. When management tells you to skimp on monitoring and failure handling in order to meet a ship date, you push back. If they override you, you refuse on-call duty for that service. (Or you just ignore their override and do your job properly anyway.) If they threaten to fire you, you quit and find a company that has respect for your off-duty time. Good devs & ops people are in high enough demand these days that you shouldn't be unemployed for long.
We just switched over to centralized logging two years ago. All hosts are configured to only keep small logfiles and rolling them over every few megabytes. Filling up the centralized logging is nearly impossible when good monitoring is done and used diskspace is never over 50%.
Btw. using an orchestration platform simplifies many of those aspects of "one node is going rough and I've to do accidentally something stupid".
Monitor the rate at which the disk is filling up, and extrapolate that to when it will hit 90%. If that time is outside business hours, alert early. If current time is not in business hours, alert later if possible.
How does this help in situations where something rogue starts filling the disk? The idea makes sense in theory, but in practice, it doesn't work out that well. Ops work is significantly harder than many devs think..
> Ops work is significantly harder than many devs think
No, it's not (I've done both). Ops is about process, and risk analysis and mitigation. Yes, there's always the possibility that something can go rogue and start filling your disk. That shouldn't be remotely common, though, if you've built your systems properly.
These days, I've been very implicit in how I run rm. To the extent that I don't do rm -rf or rmdir (edit: immediately), but in separate lines as something like:
pushd dir ; find . -type f -ls | less ; find . -type f -exec rm '{}' \; ; popd ; rm -rf dir
It takes a lot longer to do, but I've seen and made enough mistakes over the years that the forced extra time spent feels necessary. It's worked pretty well so far -- knock knock.
This seems to be the best here. As a side note: if someone does something more complicated and uses piping find output to xargs, there are very important arguments to find and xargs to delimit names with binary zero -- -print0 and -0 respectively.
I admit the code can look a little weird, but it was because I had some rather tight contrainst: 1 file, all filenames `\0` separated internally and just POSIX `sh`. I still wanted to reuse code and properly quote variables inside `xargs` invocations (because `sh` does not support `\0`-separated read's), so I ended up having to basically paste function definitions into strings and use some fairly expansive quotation sequences.
\0 is an insanely useful separator for this sort of thing and yeah, it definitely gets messy. I'm working on a similar project that uses clojure/chef to read proc files in a way that causes as little overhead as possible. \0 makes life so much easier used. The best example I can think of off of the top of my head is something similar to:
I was so freaked out at the news, I normally have local backups of my projects but I just happened to be in the middle of a migration where my code was just on Gitlab, and then they went down... Luckily it all turned out OK.
\0 is very useful but I really wish for an updated POSIX sh standard with first-class \0 support.
On your code, why do you replace \0's with newlines? egrep has the -z flag which makes it accept \0-separated input. A potential downside to it is that it automatically also enables the -Z flag (output with \0 separator).
I solved the "caller might use messy newline-separated data"-problem by having an off-by-default flag that makes all input and output \0-separated; this is handled with a function called 'arguments_or_stdin' (which does conversion to the internal \0-separated streams) and 'output_list' (which outputs a list either \0- or \n-separated depending on the flag).
I would add a step where you dump the output of find (after filtering) into a textfile, so you have a record of exactly what you deleted. Especially when deleting files recursively based on a regular expression that extra step is very worthwhile.
It's also a good practice to rename instead of delete whenever possible. Rename first, and the next day when you're fresh walk through the list of files you've renamed and only then nuke them for good.
I am actually curious what user they were logged in as and what permissions were in effect.
Unfortunately, the answer most places is that the diagnostic account (as opposed to the corrective action account) is fully privileged (or worse, root).
From a comment in the doc ("YP says it’s best for him not to run anything with sudo any more today, handing off the restoring to JN"), I assume they're running as a regular user, sudo'ing as necessary.
Having a snapshotting storage system (NetApp) once saved a lot of pain when I accidentally deleted the wrong virtual machine disk from an internal server (hit the system disk instead of a removed data disk) I was able to recover the root disk from a snapshot and bring up the machine in less than an hour.
Snapshots are not a backup strategy, but they make me sleep better at night regardless.
All of my non-production machines have emojis in PS1 somewhere. It sounds ridiculous, but I know that if I see a cheeseburger or a burrito I'm not about to completely mess everything up. Silly terminal = silly data that I can obliterate.
I think I'd rather make the production systems stand out, and add the emojis there. My prompts have a red background, but emoji prompts just tickle me, somehow.
ha! I do the same thing with figlet and cowsay, it prints a big dragon saying "welcome to shell!", if I see that, then I know I'm on a box I own/have sudo/am me. it's a good visual reminder. I don't fuss with prompts much, but this is a pretty good idea!
I do this too, but in this case both machines were production, so this alone would not have sufficed. The system-default prompts on the other hand are universally garbage.
I have a user and a local admin account at work on our Windows SOE. I have made my PC's cmd.exe show green on black in large font, but intentionally did not change the local admin. When I run a command prompt in admin it's visible different. This is a great tip and it's caught me many times.
If using iterm you can set a "badge" text on a terminal window that shows up as an overlay. Super useful when you have lots of SSH sessions open to different servers.
'db1' vs. 'db2' is still insufficiently clear, though. Even better would be e.g. to name development systems after planets and production systems after superheroes. Very few people would mistake 'superman' for either 'green-lantern' or 'pluto,' but it's really easy to mistake 'sfnypudb13' for 'sfnydudb13.'
Visually there's not a whole lot of difference. Ideally, you want something where the shape of each name is reasonably distinct from all the rest — otherwise folks will just ignore that odd blot before the prompt.
And then, once you've started naming things as $ENV-$TYPE, someone will want to cram in the location, and the OS, and the team which maintains it, and the customer, and and and. Then someone will reduce all of those identifiers into single characters … and you'll be in the situation I mentioned. Clearly clpudb1 is CharlesCorp's first London production database system!
I strongly argue against those types of abbreviations in naming, for exactly the points you make.
Here's how I view it:
The distinction between prod and dev is pretty clear cut. The words "dev" and "prod" have significantly different shapes and are immediately unambiguous. There's no need to remember a superhero vs astronomical object distinction.
As the number of database server instances grows, you've got another level of naming issues that arises when needing to distinguish between hosts and db server instances—and perhaps even database/schema independence if necessary. Staying consistent and easy-to-use with an arbitrary naming scheme becomes increasingly unwieldy, in my opinion.
Additionally, I have taken up a habit of not abbreviating "production" and "staging", whenever possible. The fact that len("production") > len("staging") > len("dev") is a feature when you find yourself typing it into a terminal or db shell.
At work we have an ascii-art, most-of-80x40-filling version of our logo rendered in an environment-associated color on login + matching prompt. The ascii-art logo might not have helped in this case (if it was a long sequence of console interactions), but It does catch "db1" vs "db2" typos, for instance, and also elicits a certain reverence upon connection.
Why would it matter? In my last job we had user home directories synced via puppet (I am overly simplifying this) which enabled any ops guy to have same set of shell and vim configuration settings on production machines too.
I daresay - having hostname as part of prompt saves lot of trouble.
Having the hostname on the prompt is a good idea, but I don't think it would have helped with this process failure.
I work at a company where we have hundreds of database machines. Running this kind of command _anywhere_ without some kind of plan would be foolish. (It's one of the reasons why we have a datastore team that handles database administration.)
But the same lesson applies to application servers as well. Don't run deleterious commands out of curiosity. Have a peer-reviewed roll plan to act on when doing things like this. A role plan would have called for verifying the host before running the command.
But even before that, the issue should have been investigated more!
All of these things contributed to the failure. There should ideally be better ownership through dedicated roles, peer-reviewed processes for dangerous activities, and a better process for investigation that does not involve deleting things haphazardly.
Also a good lesson for testing your availability and disaster recovery measures for effectiveness.
Far, far too many companies get production going and then just check to see that certain things "completed successfully" or didn't throw an overt alert in terms of their safety nets.
Just because things seem to be working doesn't mean they are or that they are working in a way that is recoverable.
I'm not sure there is a "late" at night or tired in the incident report. All times are UTC and it's unclear where all the team is located but if in SF then this is at 4pm which is when the incident just occurred. It doesn't necessarily change that you shouldn't be firefighting for extremely long cycles in hero-mode for that long, but not exactly the same as exhausted and powering away for hours.
"At this point frustration begins to kick in. Earlier this night YP explicitly mentioned he was going to sign off as it was getting late (23:00 or so local time), but didn’t due to the replication problems popping up all of a sudden."
I think I can count on one hand the number of times I've run an rm command on a production server. I'll move it at worst, and only delete anything if I'm critically low on disk space. But even then I don't even like typing those characters if I can avoid it, regardless of if I'm running as root or a normal user.
I remember when we accidentally deleted our customers' data. That was the worst feeling I ever had running our business. It was about 4% of our entire storage set and had to let our customers know and start the resyncs. Those first 12 hours of panic were physically and emotionally debilitating - more than they have the right to be. I learned an important lesson that day: Business is business and personal is personal. I remember it like it was yesterday, the momement I conciously decided I would no longer allow business operations determine my physical health (stress level, heart rate, sleep schedule).
For what it's worth, it was a lesson worth learning despite what seemed like catastrophic world-ending circumstances.
We survived, and GitLab will too. GitLab has been an extraordinary service since the beginning. Even if their repos were to get wiped (which seems not to be the case), I'd still continue supporting them (after I re-up'd from my local repos). I appreciate their transparency and hope that they can turn this situation into a positive lesson in the long run.
Best of luck to GitLab sysops and don't forget to get some sleep and relax.
Definitely get sleep, but it would be nice if the site were back online before that. I actually just created a new GitLab account and project a couple days ago for a project I needed to work on with a collaborator tonight. This is not a good first impression.
I applaud their forthrightness and hope that it's recoverable so that most of the disaster is averted.
To me the most illuminating lesson is that debugging 'weird' issues is enough of a minefield; doing it in production is fraught with even more peril. Perhaps we as users (or developers with our 'user' hat on) expect so much availability as to cause companies to prioritize it so high, but (casually, without really being on the hook for any business impact) I'd say availability is nice to have, while durability is mandatory. To me, an emergency outage would've been preferable to give the system time to catch up or recover, with the added bonus of also kicking off the offending user causing spurious load.
My other observation is that troubleshooting -- the entire workflow -- is inevitably pure garbage. We engineer systems to work well -- these days often with elaborate instrumentation to spin up containers of managed services and whatnot, but once they no longer work well we have to dip down to the lowest adminable levels, tune obscure flags, restart processes to see if it's any better, muck about with temp files, and use shell commands that were designed 40 years ago for when it was a different time. This is a terrible idea. I don't have an easy solution for the 'unknown unknowns', but the collective state of 'what to do if this application is fucking up' feels like it's in the stone ages compared to what we've accomplished on the side of when things are actually working.
Be careful not to overlook the benefits of instrumentation even in the "unknown unknowns" scenario. If you implement it properly, the instruments will alert you to where the problem is, saving you time from debugging in the wrong place.
The initial goal of instrumentation should be to provide sufficient cover to a broad area of failure scenarios (database, network, CPU, etc), so that in the event of a failure, you immediately know where to look. Then, once those broad areas are covered, move onto more fine-grained instrumentation, preferably prioritized by failure rates and previous experience. A bug should never be undetectable a second time.
As a contrived example, it was "instrumentation," albeit crudely targeted, that alerted GitLab the problem was with the database. This instrumentation only pointed them to the general area of the problem, but of course that's a necessary first step. Now that they had this problem, they can improve their database-specific instrumentation and catch the error faster next time.
Engineering things to work well and the troubleshooting process at a low level are one and the same. It's just that in some cases other people found these bugs and issues before you and fixed them. But this is the cost of OSS, you get awesome stuff for free (as in beer) and are expected to be on the hook for helping with this process. If you don't like it, pay somebody.
Really everyone could benefit from learning more about the systems they rely on such as Linux and observability tools like vmstat, etc. The less lucky guesses or cargo culted solutions you use the better.
Seems like very basic mistakes were made, not at the event but way long before. If you don't test to restore your backups, you don't have a backup. How does it go unnoticed that S3 backups don't work for so long?
Not "can restore them", it's "have restored them".
Best way to ensure that is to have backup restoration be a regularly scheduled event. For most apps I work on, that's either daily or (worst case) weekly, with prod being entirely rebuilt in a lower environment. Works great for creating a test lane too!
> How does it go unnoticed that S3 backups don't work for so long?
My uneducated guess (this one hit a friend of mine): expired/revoked AWS credentials combined with a backup script that doesn't exit(1) on failure and just writes the exception trace to stderr.
As I read the report I notice a lot of PostgreSQL "backup" systems depend on snapshotting from the FS & Rsync. This may work for database write logs, but it certainly will corrupt live git repositories that use local file system locking guarantees. NFS also requires special attention (a symlink lock) as writes can be acknowledged concurrently for byte offsets unless NFSv4 locking & compatible storage software is used.
Gitlab, I know you are all under pressure atm but when the storm passes feel free to reach out to my HN handle at jmiller5.com and I'd be happy to let you know if any of your repository backup solutions are dangerous/prone to corruption.
I see LVM[1] mentioned in the notes. It allows you to, among other things, snapshot a filesystem atomically which you could then mount read-only to a separate location to read for backups or export to a different environment. That would give you a point in time view of the state of all the repos that should be as consistent as a "stop the world then backup" approach.
LVM snapshots the raw block device (logical volume). The filesystem is layered on top of that, and then open and partially written files on top of that. So snapshotting an active database is really not the best idea; it might work, it should work, but it'll need to discard any dirty state from the WAL when you restart it with the snapshot. You might be in for more trouble with other data and applications, depending upon their requirement for consistency.
It's definitely not as consistent as "stop the world then backup" because the filesystem is dirty, and the database is dirty. It's equivalent to yanking the power cord from the back of the system, then running fsck, then replaying all the uncommitted transactions from the WAL.
It's for this reason that I use ZFS for snapshotting. It guarantees filesystem consistency and data consistency at a given point in time. It'll still need to deal with replaying the WAL, but you don't need to worry about the filesytem being unmountable (it does happen), and you don't need to worry about the snapshot becoming unreadable (once the snapshot LV runs out of space). LVM was neat in the early 2000s, but there are much better solutions today.
> LVM snapshots the raw block device (logical volume). The filesystem is layered on top of that, and then open and partially written files on top of that. So snapshotting an active database is really not the best idea; it might work, it should work, but it'll need to discard any dirty state from the WAL when you restart it with the snapshot. You might be in for more trouble with other data and applications, depending upon their requirement for consistency.
> It's definitely not as consistent as "stop the world then backup" because the filesystem is dirty, and the database is dirty. It's equivalent to yanking the power cord from the back of the system, then running fsck, then replaying all the uncommitted transactions from the WAL.
I was referring to using LVM to snapshot the filesystem where the git repos are hosted. It'd work for a database as well, assuming your database correctly uses fsync/fdatasync, and for git specifically it works fine.
Using LVM snapshots with a journaled filesystem (i.e. any modern/sane choice for a fs) should have no issues though there would be some journal replay at mount time to get things consistent (v.s. say ZFS which wouldn't require it). If it does have issues, you'd have the same issues with the raw device in the event of hard shutdown (ex: power failure).
Quick question for you since it seems like you are very knowledgeable.
I am the sole back end developer for a greenfield web application that is very data heavy. The application is still in alpha at the moment, but part of the development process involved prepopulating the database with about 10 mil rows or so spread out over about 15 tables. Nothing too crazy. However, once the application is launched I expect to have exponential growth in data due to the nature of the application.
Currently, this application is set up on Linode. The database server is standalone and I have the ability to spin up multiple application machines and a load balancer. Each of these application machines read from and write to the single database machine. The database machine itself has full disk image backups taken every 24 hours, 7 days, and month. I also do a manual snapshot from time to time. On top of this I also usually dump into a tar file on my own external drive every once and a while as well. I'm fairly new to devops stuff and most of my experience involves building applications and not necessarily deploying them. I'm wondering if what I'm doing as far as backups and stability is enough or if I should be incorporating other methods as well. Have any thoughts?
The link discusses why rsync and tarballs are not good backup solutions. But, those wouldn't rightly be called "snapshots". "Snapshot" implies atomic, right? Surely an atomic snapshot would not corrupt git -- I would expect git, like any database, is designed to be recoverable after power failure, to which recovering from an atomic snapshot should be equivalent.
Say a git server is in the middle of a write to refs/heads/master . You atomically snapshot the FS and then a power outage kills the server. The repository state will have a small chance of lock files that are never removed. Depending on the lock, future writes to a ref or to the repository can fail.
Not the worst situation as data lose won't occur but definitely not a stable state. If you treat git repositories as a service that needs a recovery step it would be fine, unfortunately most don't.
OK, from your link, it looks like the only problem affecting atomic snapshots is that `.git/index.lock` could be left in existence which will block git from doing any operations until it is deleted. For some reason git will instruct the user to delete this file but will not do it automatically, even though it could actually check if any live processes have the file locked, and assume that, if not, it is stale.
Seems like a bug in git IMO, but still reasonably easy to recover from.
Filesystem snapshots and rsync are between the PostgreSQL backup best practices. Not really for logs, but for stored data. (For logs you don't really need snapshots.)
The other good alternative is atomic backups (like with pg_dump), but that does put some extra load on your database that may be unacceptable.
Yes, you will want something different for your git repositories. There is no backup procedure that is best for all cases.
Unfortunately, this kind of situation, "only the ideal case ever worked at all", is not uncommon. I've seen it before ... when doing things the right way, dotting 'I's and crossing 'T's, requires an experienced employee a good week or two, it's very tempting for a lean startup to bang out something that seems to work in a couple days and move on.
Tom Watson Jr., CEO of IBM between 1956 and 1971, was a key figure in the information revolution. Watson repeatedly demonstrated his abilities as a leader, never more so than in our first short story.
A young executive had made some bad decisions that cost the company several million dollars. He was summoned to Watson’s office, fully expecting to be dismissed. As he entered the office, the young executive said, “I suppose after that set of mistakes you will want to fire me.” Watson was said to have replied,
“Not at all, young man, we have just spent a couple of million dollars educating you.”
This happened to me one night late a few years back, with Oracle on a CentOS server. rm -rf /data/oradata/ on the wrong machine.
I managed to get the data back though, as Oracle was still running and had the files open. "lsof | grep '(deleted)'" and /proc/<ORACLEPIDHERE>/fd/* saved my life. I managed to stop all connections to the database, copy all the (deleted) files into a temp directory, stop Oracle, copy the files to their rightful place, and start up Oracle, with no data lost.
I haven't seen or done anything of this scale before, but I did have a very sobering moment while working on a large online retailers stack as a systems engineer.
We were rolling out a new stack in another data center across the country and before replication went live, I decided to connect and check things out. Our chef work hadn't completed for the database hosts, so I decided to install some OS updates by hand using pssh on all the MySQL hosts and saw a kernel update. So I thought, the DC isn't live yet, no replication is running, I'll just restart these servers. So I did using pssh again and then I caught a glimps at the domain in some output and my face went completely pale. I restarted the production databases... all of them. And they all had 256GB of ecc memory. It takes a very long time for each of those machines to POST.
I contacted the client and said the maintenance page was my fault and was fully expecting to be fired on the spot, but they just grilled me about being careful in the future, and then laughed it off.
I've been the most careful ever since then. It scared me straight. Always make sure you are in the right environment before you do anything that requires a write operation.
Not sure if the doc here is refreshing or scary. But Godspeed GitLab team. I've loved the product for about two years now, so curious to see how this plays out.
I very much appreciate their forthrightness and the way they conduct their company generally. Having said that, I have the code I work on, related content, and a number of clients on the service.
[edit for additional point]
They need the infrastructure guy they've been looking for sooner than later. I hope there's good progress on that front.
Note for the person that downvoted my comment: GitLab are fantastic, I'm a big advocate for their product & support. My comment to Sid was in support of helping based on some of the notes I saw in their very transparent report, Sid & I have talked several times in the past & I have quite a bit of PostgreSQL experience - so my comment was a positive one offering support when / if needed, not a negative / piss take if it came off as such.
I was a Sysadmin for a university for a while. During those years I also messed up once. I learnt one thing: If you run a command that cannot be undone, you pause, take a few seconds and think. Is it the right directory? Is it the right server (yes this has happened to me too). Are those the right parameters? Can I simulate the command beforehand?
I introduced this double check for myself, and it has actually caught a few commands I was about to run.
This is why you have lots of copies of your data. Matt Raney has a great talk about designing for failure that includes details on Uber's "worst outage ever." It too involved postgresql replication and mistaking on host for another, but they didn't lose any data because they had more than a dozen live copies of their database.
This isn't an alternative to working backups, of course, but it is an additional safety net. Plus it can give you a lot more options when handling an incident.
Amazing document. Thank you for sharing. Taking it back to my company to make sure we can learn from it and to know what to check (like our logical backups... I know we've seen issues with our 9.5 servers and RHEL7 defaulting to 9.1 or 9.2 on our host where we take the backups from! Verifying exit code here we come...)
@sytse, I noticed you _do_ use streaming WAL replication, but I didn't notice any mention of attempting Point In Time Recovery. Have you taken a look into archiving the WAL files in S3? Those, along with frequent pg_basebackup's (frequent because replaying a WAL file has been painfully slow for us) could allow you to point in time recover to either a timestamp, or a transaction (and before or after). https://www.postgresql.org/docs/9.6/static/continuous-archiv...
We use https://github.com/wal-e/wal-e to manage our uploading to swift (no S3 at our company heh) and then inhouse tooling to build a recovery.conf. Note we actually have our asynchronous followers work off of this system too so they're not taking bandwidth from the primary.
(note this is can lead to ~ 1 WAL file of data loss, but is acceptable for us.)
I doubt I could be of any help, since reading the report definitely shows y'all having an up on me with pg knowledge, but if there's anything I can do / talk about feel free to reach out.
I know it's not wal-e, but Barman recently added support for streaming WAL from postgres, so in theory you shouldn't lose any data if the master crashes. Note that this does require a replication slot on the master to implement.
<rant>
It's also stupid that you still have to set up WAL shipping (e.g. via rsync or scp) before taking a base backup even if you have streaming replication enabled.
</rant>
That being said though, I have not been happy with the restore performance of barman, though admittedly this may be I/O related.
I don't think we use PITR, certainly not to S3. I believe wal-e was discussed internally some time in the past, but we never really did anything with it.
Does this mean whatever was in that database is gone, with no available backups?
Is this an SOA where important data might lie in another service or data store, or is this a monolithic app and DB that is responsible for many (or all) things?
What was stored in that database? Does this affect user data? Code?
We have snapshots, but they're not very recent (see the document for more info). The most recent snapshot is roughly 6 hours old (relative to the data loss). The data loss only affects database data, Git repositories and Wikis still exist (though they are fairly useless without a corresponding project).
The doc says that there is a LVM snapshot being 6 hours old. <strike>And there should be a regular logical backup with at most 24 hours age as well (they just can't find it for whatever reason).</strike> (Scratch that, my doc did not update, despite Google saying it should automatically update).
Regarding what's gone: The production PostgreSQL database. This suggests that the code itself is fine, but the mappings to the users are gone. But git is a distributed VCS after all, so all the code should be on the developer's machines as well.
And LVM snapshots aren't exactly the most reliable way to do a filesystem backup either, in terms of filesystem consistency, data consistency or reliability of the snapshot itself. And even kernel locking bugs when creating and destroying them, plus udev races when it fires off events when changes occur. I stopped using them years ago due to the sheer numver of problems.
Have gitlab considered something like ZFS snapshots? They don't have the consistency problems or the space wasting and reliability problems that LVM brings. You could have it take snapshots every few minutes.
If you are not running ZFS NOW you are doing it WRONG. There is NO excuse to use anything else. Stop the barbarism of using and propagating other FS's, they are dinosaurs, broken and their use makes you ill equipped for today let alone the future.
This is painful to read. It's easy to say that they they should have tested their backups better, and so on, but there is another lesson here, one that's far more important and easily missed.
When doing something really critical (such as playing with the master database late at night) ALWAYS work with a checklist. Write down WHAT you are going to do, and if possible, talk to a coworker about it so you can vocalize the steps. If there is no coworker, talk to your rubber ducky or stapler on your desk. This will help you catch mistakes. Then when the entire plan looks sensible, go through the steps one by one. Don't deviate from the plan. Don't get distracted and start switching between terminal windows. While making the checklist ask yourself if what you're doing is A) absolutely necessary and B) risks making things worse. Even when the angry emails are piling up you can't allow that pressure to cloud your judgment.
Every startup has moments when last-minute panic-patching of a critical part of the server infrastructure is needed, but if you use a checklist you're not likely to mess up badly, even when tired.
1) Patio11 touches on a very good lesson, in passing, in an article about Japanese business[1]:
While raw programming ability might not be highly valued at many Japanese companies, and engineers are often not in positions of authority, there is nonetheless a commitment to excellence in the practice of engineering. I am an enormously better engineer for having had three years to learn under the more senior engineers at my former employer. We had binders upon binders full of checklists for doing things like e.g. server maintenance, and despite how chafing the process-for-the-sake-of-process sometimes felt, I stole much of it for running my own company. (For example, one simple rule is “One is not allowed to execute commands on production which one has not written into a procedural document, executed on the staging environment, and recorded the expected output of each command into the procedural document, with a defined fallback plan to terminate the procedure if the results of the command do not match expectations.” This feels crazy to a lot of engineers who think “I’ll just SSH in and fix that in a jiffy” and yet that level of care radically reduces the number of self-inflicted outages you’ll have.)
2) I once heard organisational 'red tape' described as 'the scar tissue of process failures' and it is absolutely true and I deeply regret not recording the source of it. Whenever you wonder why there's some tiresome, overly onerous process in place that is slowing you down, consider why it may have been put in place - chances are, there was a process failure that resulted in Bad Things. When you wonder why big orgs are glacially slow compared to more nimble startup competitors, understand that those startups have yet to experience the Bad Things that the big org has probably already endured. Like scar tissue, the processes they develop reduce their agility and performance but also serve to protect the wounds they experienced.
One of thing that I learned the hard way about "Japanese companies" - despite western conceptions, every company in Japan has its own unique culture (and takes pride in having its own culture!). What's more, often departments and division inside the same companies work in very different way.
Why am I saying that? Because in some of the Japanese companies I've worked with were the exact opposite of that. To be sure, lip service was duly paid to the aforementioned "commitment to excellence", and every release procedure had its own operational manual, sometimes 300 steps long. Repeated manually for every server. Out of 100-200.
Configuration updates? Sure, let's log in to every server and vi the config file. How do we keep excellence? Just diff with prev and verify (with your eyes that is) that the result is the same as in your manual. After every "cd" you had to do a pwd and make sure that you moved to the directory you meant to. After every cp you diffed to the original file to make.
Releases obviously took all day or often all night, and engineers were stressed and fatigued by Sisyphean manual with its 300 steps of red tape. They invariably made silly mistake, because this is what you get when you use human beings as glorified tty+diff. We had release issues and service outages all the time.
We've fortunately managed to move away to modern DevOps practices with a lot of top down effort. But please don't tell me every Japanese company magically delivers top quality. Some of them do, some of them don't, even in the same industry. Insane levels of bureaucracy could be found all across the board, but whether that bureaucracy actually encourages or deters quality is an entirely different story.
Unfortunately I had similar experiences as well; incredibly manual processes, frighteningly long manual procedure descriptions instead of scripted solutions.
My opinion: script it. Always. It doesn't matter if it's ansible, bash, puppet, python, whatever, just make sure it's not an ad-hoc command. Test the script on a server which can be sacrificed. Test as long as there is a single glitch. Run it in production.
It's to eliminate typos and to have a "log" to see what actually had been done.
Oh, absolutely. Where something can be scripted, script it. Why? Because scripting is a process development. You write something, validate it and then remove the human error element.
For things that you can't script, you write abstracted processes that force the executor to write down the things that could cause Bad Things to happen, and use that writing down stage to verify that it's not going to cause a Bad Thing. That forces people to pause and consider what they're doing, which is 80% of the effort towards preventing these issues.
eg: Forcing YP to write down which database they were scorching would've triggered an 'oh fuck' moment. Having a process that dodged naming databases as 'db1' and 'db2' would've prevented it. etc. etc. etc.
Which is what we did, obviously and nobody is allowed to run anything manually while SSHing to production.
But there was a tremendous organic resistance to that from the very same "culture of excellence in engineering". "How can we be sure it works if it's automated?"
"It's safer to manually review the log"
"How can you automate something like email tests are or web tests?"
"It's no worth automating this procedure, we only release this app once a year, and it only takes 5 hours".
Expect to hear these kind of claims when engineers have got the equality "menial work == diligence == excellence" pummeled into them for generations.
Also, script disaster recovery too. Script it when creating your backup procedure (not at the time of disaster), use the script to test your procedure, and do it often.
This way, when your script fails, you can recover quickly.
For the public record, by quoting you I wasn't implying that you agreed with my #2 either. I just felt I gained a lot out of both points, that they both resonated with my experiences, and that they both articulated the lessons I'd learnt in my career.
edit: I split this with my parent reply to try to make the two separate points clearer
Oh no worries at all; I just like continuing to beat my "Japan is a big country with a diversity of practices and attitudes in it" drum. (It is under-beaten both inside and outside of Japan.)
Agreed. I wasn't trying to make the point about "Japanese companies" (and I've edited my other post replying to patio to split the two comments I was making so that they're clearer) but rather about the process aspect. I am sure that Japanese companies, just like Western companies, come in a wide range of competencies. Clearly patio worked for a great one, however, and those lessons apply to companies all over the world. That's why I quoted it.
Definitely. Though I admit I prefer automation or red tape where possible.
What you can say about Japan, is that since technology-wise it tends to be behind the US (of course this too is a gross generalization), you can expect most non-startups to use bureaucracy over automation (and modern DevOps practices in general) to regulate production operation quality. The unfortunate side here is that bureaucracy is much more fragile and when it fails, it tends to fail spectacularly.
> When you wonder why big orgs are glacially slow compared to more nimble startup competitors, understand that those startups have yet to experience the Bad Things that the big org has probably already endured.
Another explaination for big orgs vs teeny start ups: What level of failure is acceptable? For a teeny start up, a few hours being down is not so important. For (say) a bank, being down for a few hours might be mentioned in the national newspapers.
That is certainly true of some of the red tape, but in no way it's true for the majority. A lot of "process" is created because the people in charge of the process need to validate their existence.
Undoubtably that's a problem, I agree. Competent management helps minimise that, though. And I would not agree that the majority is existance justification. Then again, I'll disclaim that by saying I work in a Mech Eng. role, where process for safety's sake has been established and engrainedi nto the culture for literal centuries.
A good employment environment is one where you may ask why a process exists and receive valid justifications therein, but where the idea of not following it, no matter how bad, never crosses your mind. I acknowledge I'm really lucky to work in an industry that doesn't fall too far from that target.
If you get the chance to observe pilots operating in the cockpit, I'd recommend it. Every important procedure (even though the pilot has it memorized) is done with a checklist. Important actions are verbally announced and confirmed: "You have the controls" "I have the controls". Much of flight training deals with situational awareness and eliminating distractions in the cockpit. Crew Resource Management[1].
Here's a great documentary [0] by Errol Morris about the United Flight 232 crash in 1989 [1].
"..the accident is considered a prime example of successful crew resource management due to the large number of survivors and the manner in which the flight crew handled the emergency and landed the airplane without conventional control."
Another recent example is Qantas QF32 had an engine explode (fire then catastrophic/uncontained turbine failure) and the A380 landed with one good engine, and two degraded engines. The entire cockpit crew of 5 pilots did a brilliant job in landing the jet.
I kind of agree with the conclusion, but I'd look at that from the opposite direction: i.e. just like frameworks, checklists provide a way to avoid some mistakes in repetitive, boring practices. But you'll never avoid them all, and you're going to need a lot of red tape.
Even better to avoid all that and make it idiot proof. I'd rather not be in the situation where my only protection is rigmarole. But sure, as a last resort (just like frameworks) - much better than nothing. Typically in programming, frameworks are premature. A simple api tends to suffice, with checked (or typechecked) inputs and outputs. But sure, if for some reason you can't make that, and you need complex interactions with dynamically generated code, variable number of order-dependant parameters, black-box "magic" base-types, stringly-typed unchecked mini-languages, multiple sequentially dependant calls into the same thing, or any other tricky api you can't (or won't) easily detect misuse for... then a framework is the least-bad amongst bad options.
I speak their language (it's German, but people from Switzerland speak a pretty strong dialect) and they discuss highly technical and serious stuff, but their language is just so adorable when they mix the English and the German. I always thought this communication is English only, nowadays?
Every time you create a checklist for developers, it's a mild kind of failure. Human procedures fail too, and we should rely on those as little as possible. Instead of checklists, we should have tested and debugged software.
Now, when you can't have tested and debugged software, yeah, formal procedures are the second best thing. Just don't get complacent there.
Ive been lucky to be in a cockpit before and during takeoff. It was an amazing experience to listen to the pilots going throug checklist, and agreeing on procedures in case of engine failure etc. Before anything was actually done, training me to take on a pilots mask etc. And follow procedures, If something should happen. Nothing did happen of course.
Great operations teams have incident response procedures based on checklists and clear communication channels like that as well. If this interests you I recommend David Mytton's talk at dotScale 2015: https://www.youtube.com/watch?v=4qGcTOQRvEU
This[1] chaps story is very relevant. Taking those procedures and applying them to the medical industry. I swear I heard this on a podcast but I'm pretty sure the only thing I'd have heard it on is this american life and I can find no mention of it.
Yes, good tip from "Turn the Ship Around" by David Marquet is to use the "I intend to" model. For every action you are going to undertake that is critical, first announce your intentions and give enough time for reactions from others before following through.
I saw that Space Shuttle landing video that was kicking around recently. In that they also had explicit "I agree" responses to any observation like "You're a bit below flight path". Quick, positive acknowledgment of anomalous events or deviations. Seemed really ... sane.
I heard that as "I show you xx". One guy is flying with his head up, the other one talking most is monitoring with his head in the instruments, and helping the pilot flying getting confident data.
Remember they're driving 1970's technology, redundant everything, and they all grew up flying "steam gauges", where the culture includes tapping on the glass to make sure the needle didn't stick. They want to compare every sensor output for sanity so they can disregard one if needed.
This vid is also the source of cockpit audio for the FSim shuttle simulator game if you like this stuff.
Yeah. Oh and Houston did the same thing, calling out the 180 and the 90 on the HAC - heading alignment circle. Just helping them out with their radar indication.
The military uses "intention" rather than "I will" because they understand that no plan survives contact with the enemy. It is also higher level so that when exhausted, stressed subordinates find themselves in life threatening circumstances the most important thing they need to remember is the intent. If they forget steps 1-5 of the plan but recall the intent they can't go that badly wrong in using their initiative.
There's also the distinction between specific orders - "you are to". Knowing the commander's intent, and his commander's intent (known as 1 Up and 2 Up) enables Mission Command, the concept of giving latitude to subordinates to achieve the mission in the best way possible within the confines and direction given to them.
Finally there's the ritual of it, NATO forces expect to operate within multi-national structures where English won't be a first language. There is a "NATO sequence of orders" which should be roughly followed. It means everyone knows the structure of what is coming up and when in the brief - so you don't get people asking questions about equipment when the limitations are being explained, they know that comes later. Opening, especially from a junior officer, with "my intention is" is essentially like having a schema definition at the start of a document - it defines the structure of what is coming for those who are going to be parsing it.
Yup, it's never the fault of a person, always of the system. Once we get this resolved we'll definitely look at ways to prevent anything like it in the future.
The reasoning for that was to prevent reducto ad absurdum and "It's YOUR fault".
"You should have KNOWN that erudite command was going to fail."
"You should have known that our one-off program had issues you did not account for."
"You should have known that the backups were not properly tested elsewhere for known good state."
"You should have ....."
In reality, some disasters were caused by idiotic things like "rm -rf /opt/somedir ." You just hosed the system, or a large part of it quickly. And we could say that your malfeasance of including the "." started wiping / immediately. But we can also say that rm should be aliased to prevent accidents like that, or that rm should do some minimalistic sanity checking on critical directories before executing them.
People can, and will mess up. These computers are nice, in that they can have logic that can self-correct, or at least loudly alert errors.
At some point you have to blame the person. If the person did it wilfully, deliberately, ignoring the checks and safeguards. Or because there's already so much system that having any more of it would place severe restrictions on everyday tasks.
I'm not saying that's the case here, because it does seem that GitLab has systemic deficiencies. But "never" and "always" are such strong statements.
Prosecutor: Mr. Accused, here is evidence that you murdered that other person. Accused: It's not my fault, but the systems'. Judge: Oh, ok. You are a free man.
The law and a "no fault" postmortem are entirely different things and you conflating the two doesn't help the discussion.
Let's assume a bad actor in a company. It still doesn't help improve the situation to allow blame to rest with the bad actor. Definitely, there should be penalties applied (likely the termination of their position), but it doesn't help your company at all to stop there.
Did they delete data? Why is there no secure backup system in place to recover that data? Why was there such lax security in place to allow them to delete the data in the first place? Why are we hiring people who will go rogue and delete data? Did they "turn" after working here for a while because of toxic culture, processes, etc?
Hell, if the law worked this way, we might actually have less crime because we'd look further into the causes of crime and work to address them instead of simply punishing the offenders.
This sounds good until you consider that many systems are utilizing multiple drives. When someone is expecting to delete a large file and it ends up on a different drive, problems could arise.
Renaming a file should not move its data, right? So rename the file into .file.$(date -Iseconds).trash (but make sure that no legitimate files are ever named in this pattern). Then put that file path into a global /var/trashlist. To cleanup, you just check that file for expired trash and make the final deletion.
Beware race conditions when writing to /var/trashlist (assuming you mean "a file with one path per line.")
Proposed tweaks: symbolic link into /var/trashlist directory, where the name of the symbolic link is "<timestamp>-<random stub>-<original basename>". Timestamp first so we can stop once we hit the first too-recent timestamp, random stub to unique the original base name if two different files In different directories are deleted at the same timestamp, original file name for inspection.
Nice, but in practice I've found that when working with replication on big production DBs I usually don't have the space to hold multiple copies of a data dir that I want to delete, and copying elsewhere takes too long. Not a blocker to using it by any means as at least this adds a natural sanity-checking stop.
Which, as mentioned is a systemic problem that has to be solved by training. And/or you can set up cron jobs to do the cleaning. Or have some conditional script triggers.
You're at 90%, and you have an app error spewing a few MB of logs per minute. Your on-call engineer /bin/rm's the logs, and instead of going to 30%, you're still at 91%, only the files are gone. Your engineer (rightfully) thinks "maybe the files are still on disk, and there's a filehandle holding the space", so instead of checking /proc to confirm, he bounces the service. Disk stays full, but you've incurred a few seconds of app downtime for no reason, and your engineer is still confused as shit. Your cron job won't kick in for hours? Days? In the mean time, you're still running out of disk, and sooner or later you'll hit 100% and have a real outage.
Cron job is a stupid hack. It doesn't solve any problems that aren't better solved a dozen other ways.
That engineer should have read the documentation. Failing that a `du -a / | sort -n -r` would immediately show the conclusion you jumped to was wrong. Randomly bouncing services on a production machine is a cowboy move.
No documentation, no checklists? That's the source of your problem, not a trash command which moves files rather than deleting them.
Changing the meaning of a decades old command is the problem - du / can take hours to run on some systems, and randomly bouncing shit in prod is the type of cowboy shit that happens in a panic (see gitlab live doc).
Docs and checklists are fine, but at 2am when the jr is on call, you're asking for problems by making rm == mv
Maybe add a bash function to check the path and ask the magic question: "Do you really want to delete the dir XYZ in root@domain.com?" .... but then again when you're in panic mode you might either misread the host or hit 'y' without really reading what's in front of you.
The best thing to do is to never operate with 2 terminals simultaneously, when one of them is a production env, better login/logout or at least minimise it.
The problem in this case was that YP confused the host. YP thought he was operating in db2 (the host that went out of sync) and not db1 (the host that held the data), a message that doesn't display the current host wouldn't help in this case.
Indeed. That's happened on our systems as well. Someone issued a reboot on what they thought was a non-critical host, and instead they did it on some very essential host. The host came up in a bad state, and critical services did not start (kerberos....).
It lead to a "Very Bad Day". I found out about it after reading the post mortem.
One day, I almost accidentally shut off our master DB so I could update it. It's funny.. I leave these types of tasks for later in the day because I'm going to be tired and they're easy & quick tasks. But that almost backfired on me that day; I read the hostname a few times before it fully hit me that I wasn't where I was supposed to be.
On some machines I've actually masked `rm` with an echo to remind me not to use it, and I would delete with `rrmm`. That would give me pause to ensure that I really mean to remove what I'm removing, and more importantly, that I didn't type `rm` when I meant `mv` (which I actually have done by accident).
We have 2 people on rotation and anything like this you pair with someone else and talk through it. As we like to say "It's only a mistake if we both make it."
Absolutely. Same goes for any ops response. You need two people: one to triage the issue and another one to communicate with external stakeholders and to help the one doing the triage.
The military does a very similar thing. An Army company commander usually has a RTO (radiotelephone operator) to handle taking on the radio. This frees the commander to make real-time decisions and response quickly to the situation on the ground, and frees him/her from having to spend time explaining things. A really good RTO will function as the voice of the commander, anticipating what he/she needs to get from the person on the other end of the radio. This is a great characteristic of a good operations engineer, too. While the person doing triage is addressing the problem, the communicator is roping in other resources that might be needed and communicating the current situation out to the rest of the team/company
Another thing to note, is that the RTO will occasionally state someone and "ACTUAL". That means that whomever they are speaking for is actually speaking and not the RTO on behalf of the CO.
The best Ops people I have worked with (looking at you Dennis and Alan) repeat everything back that I say. More than once I have caught mistakes in my approach simply by hearing someone else repeat back exactly what I just said.
BBC's Horizon has a really good episode about checklists and how they're used to prevent mistakes in hospitals, and how they're being adopted in other environments in light of that success. It's called How To Avoid Mistakes In Surgery for the interested.
There are very, very few situations in life where there's not enough time to take a time-out, sitrep, or checklist.
I work in EMS when not in IT, and even bringing a trauma or cardiac arrest patient into the Emergency Room, there is still time to review and consider.
If there's time in Emergency Medicine, there's time in IT.
I worked in software to help manage this for a while. There are still checklists but they are produced ahead of schedule. Every instrument a nurse takes off the tray is counted and then checked at the end for instance.
Delivery room experience: before stitching my wife, the OB counted the pieces of gauze out loud with the nurse watching. They verbally confirmed the total with each other. A matching count and verbal confirmation were performed after the stitching. It inspired confidence seeing them perform this protocol.
With gauze in particular I think every nurse has a story of "that time we removed the septic gauze" with colorful descriptions of the accompanying smell.
Something related, I made a script that helps me to clean up my git repository of already-merged branches (I tend to not delete them until after a release cycle).
In this script I added a checklist of things to "check" before running it. It has worked in my favour every time I run it.
Another lesson is this one which I also learned the hard way: don't work long hours or late at night unless it is absolutely necessary. It doesn't sound like it was totally necessary for YP to be pushing this hard, and pushing that hard is what leads to these kinds of errors.
> Don't get distracted and start switching between terminal windows.
I got bitten by this in the past, luckily nothing that could not be reversed, just 2-3 hours lost. I can imagine how YP's stomach must have felt when he realised what happened.
Still, I had no idea about checklists and so many people here seem to be pretty familiar with the concept :-)
Checklists are great if you have them and often you should create one even during an outage but other times you waste time. It's really hard to balance fixing things quickly and not breaking more.
There isn't a silver bullet anyway, it's layer on layer of operational best practices what makes you resilient against such issues.
First of all, I want to say: I think the GitLab people are incredible. They're open and transparent about everything they do, and I think that in a world dominated by GitHub, the fact that GitLab exists is hugely important to decentralization and the continued march of open source. This document continues to demonstrate how the GitLab people are great, transparent people.
That being said, this is why you shouldn't entrust a cloud service to keep your data safe: "So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place."
My backups work. I know they work, because I run them and I test them. People entrusting cloud services to have good backups cannot say that.
> At this point frustration begins to kick in. Earlier this night YP explicitly mentioned he was going to sign off as it was getting late (23:00 or so local time), but didn’t due to the replication problems popping up all of a sudden.
This is why I'm not a fan of emergency pager duty.
I'm commenting a bit late, but I hope it's still read by the gitlab team.
First, you kept heads and didn't turn on each other. That's a major success, and gives me more confidence in gitlab. The rest you can improve on onlyif you have this right.
Second, I'm sure you've gotten the message to test your backups and recovery plan. It's a good time to read the Google SRE book, and consider how to put together full integration tests that build up a db, back it up, nuke the original, and recover from the db. With containers this isn't actually awful to do.
But I didn't see much mentioned about load tests. A few simple scripts that hit your server (or test instance!) hard can help you find points where things fall apart under load. Even if you don't have a good way to gracefully do anything else other than alert a human, you can figure out what to monitor and how to make sure your backup/recovery plans can deal with a shit-ton of spammer data suddenly in your DB.
Thank you for your feedback. I will add your suggestions to our document!
We will be implementing new policies on backups and you can totally expect us doing load tests in the future. The whole team wants to make sure this will not happen again.
Yes, doing this would make it easier to use services like GitHub. For example, DataKit's GitHub bridge service uses webhooks to track changes in GitHub metadata (e.g. the status of branches, tags and PRs) and stores the state in another Git branch. The DataKitCI monitors this branch to know when to build things (and writes the status back there):
I just checked git-appraise and while it looks rather mature, it uses (as far as I can see by now) an approach which is not that nice if you have multiple public remotes. Also, as far as I can see, each submitter must be able to push to that remote - please correct me if I'm wrong.
We have a different approach for this, which is more powerful (talking about how the data is stored).
Of course, our tool is not mature yet. Maybe gitlab can learn from what we've researched...
> if they would have had their PRs and Issues in git repositories, too
Well, perhaps a git repository isn't the right persistency structure for non-code items like issues and pull requests. That a git repository is a good structure for document-like entities (like code files) and tracking changes/versions, doesn't mean that it's a good choice for highly-relational or highly-dynamic objects.
They already do this. Both db1.cluster and db2.cluster are production machines; their staging equivalents are db1.staging and db2.staging. The confusion was between two production instances--one of which had the latest data, and the other was replicating. The intent was to delete the partially replicated data but the command was run on the current master database server instead.
We're using the Gitlab CE for our internal use, but one of our customer's is on Gitlab.com. We're likely still going to recommend Gitlab.com after this, but also enable mirroring to our internal instance.
As a side note, I just checked our S3 Gitlab backup bucket and it does have backups for every day for the last year (1.8 GB each yikes) so instead of failing to create the backups, its actually failing to delete the older ones! :)
Gitlab's automagic backup cleaner specifically does not work for S3 backups. To automatically clean up your s3 bucket use S3's built in lifecycle settings.
Wow, very intriguing to read. I appreciate honest event descriptions like this.
This has reminded me how important it is to perform regular rehearsals of data recovery scenarios. I'd much rather find these failures in a practice run. Thanks to GitLab for continuing to openly share their experience.
I'm going to add to your message box unnecessarily, but I want to say I love GitLab and it's a shining example of a transparent company. I still have ambitions to work there someday, and this event is hopefully a net gain in the end, in that everyone here and there learns about backups.
Thanks for the transparency. Doesn't always feel good to have missteps aired in public, but it makes us all a little better as a community to be clear about where mistakes can be made.
I've been there myself, it was at the start of my career and almost ended it. I know how incredibly emotional this kind of thing can be. Just understand you aren't the first, you wont be the last and shit happens. If you are ever in Philly I'll buy you a beer if you drink, and a dinner if you don't.
Was just trying to push earlier today and found out about the issue. Sorry man! Drinks on me in Montevideo, Uruguay. This stuff happens, more than most of us are willing to accept so, here is for your transparency and you know, fix it, learn it and on you go!
Dude, we've all been there. You are neither the first nor the last. It's never a single person. One day the technical side of our little startup collectively went to a conference (back then it was only two engineers plus a technical leader) while our server DoS'd itself by broken mail processing... we had a rough night in a Belgium hotel figuring out who attacked the site to realize it was ourselves. The 10k block of missing image IDs always stood as a reference not to leave even a low traffic site unmonitored. It happens.
Hey man, don't beat yourself up over this. It's shitty but you found some flaws in the process, in the setup, and y'all can make things better because of that.
Yorick! Thank you for the transparency, I know how tough incidents like these can be. Stop the bleeding, figure out how to handle this better in the future, but most of all, take care!
As much as I appreciate GitLabs extreme openness, that's maybe something that by policy shouldn't be part of published reports. Internal process is one thing, if something goes really bad customers might not be so good at "blameless postmortems" if they have a name to blame.
It seems to me that, as a customer, it is blame-shifting away from the company to a particular person. Blameless post-mortems are great, but when speaking to people outside the company I think it is important to own it collectively, "after a second or two we notice we ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com." I believe this isn't your intention, but that is how I interpreted it.
In our postmortems we explicitly avoid referring to names and only refer to "engineers" or specific teams. There is no reason to refer to specific names if your intention is a systems/process fix.
To me those "Engineers" read as faceless replaceable cogs. This initials make it personal, its better, we can now say "YP" thats exactly you, hey, chin up. Sounds better than "engineering team 42".
You write CEOs name on all your publications, of course always taking credit/glory, but why not let engineers do the same, take credit/ownership when doing a nice commits, and when fucking up. We're all people first, and prefer to speak/talk to people and not Engineering Team MailBox at Enterprise Corporation.
Not so long ago GitLab decided to move from using AWS Cloud to managing own hardware. I wonder if such situation could happen if they used managed Postgres with automatic backups. Most of us use Cloud because OP's is hard, and human related risks are too high.
The way you are open about this and do not blame the engineer who did "rm -Rvf" (he knows he fucked up, suffers enough already), and seeing that improvements can be made and you're willing to do them.
Applying for work with you now, and moving all my stuff to GitLab.
More than loving your service we like you guys as people and trust you to get things straightened out. I wouldn't host my code with anyone else despite this.
You know this actually makes me want to try out gitlab. They are super transparent and the fact that they ran into this issue now means they will just be better off in the long run. Does it suck? Sure. But this is why you backup your own data anyway.
> 2017/01/31 23:00-ish YP thinks that perhaps pg_basebackup is being super pedantic about there being an empty data directory, decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com
A long, frustrating day. Running destructive commands at 11pm. This is why pilots have duty time limits. YP should've been relieved by another engineer who was physically and mentally fresher. Human beings have limits, and when we reach them, we make more—and worse—mistakes. Any process that fails to account for this is broken.
"The replication procedure is super fragile, prone to error, relies on a handful of random shell scripts, and is badly documented"
This is true of many databases, but in my experience is particularly true of postgres. It's a marvelous single-instance product, but I've never really found any replication/ha tech for it that I've been that happy with. I've always been a bit nervous about postgres-backed products for this very reason.
I'd be interested what other people's take on this is, though.
I've been a DBA for thousands of MySQL hosts in production and just a handful of PostgreSQL clusters, so I'm not a PostgreSQL expert by any stretch. These PostgreSQL machines were also inherited from a company we acquired, so some of my negative impressions can likely be explained by them being set up poorly, which sadly seems to be the norm.
PostgreSQL's replication has evolved slowly over the years, and as a result it's idiosyncratic, complicated to get right, and hard to generalize about.
Postgres replication works by shipping WAL (transaction/redo) logs to slaves, and the slaves are in a constant state of DB recovery. Streaming replication does so as writes happen, and file based log shipping copies the logs when the 16MB segment is complete. Ours were set up to use streaming replication, without having a separate file-based archive log host to use to fetch older logs. Whenever the write load became too much and the slaves could no longer keep up with the master WAL logs on the master would expire and replication would break entirely, requiring you to rebuild all your slaves from backup. This was my introduction to PG replication. Apparently PG 9.4 has 'replication slots' or something that make this less of a headache.
WAL logs were also applied without checksums or verification which corrupted the entire replication chain, and as I remember we didn't discover this until the corrupt pages got hit by a query.
Software upgrades with PG seemed to be a huge pain as well, we could never figure out how long they were going to take, and even if the on-disk formats didn't change, somehow all the cardinality stats on our tables disappeared, and all our query plans went to shit until we ran some process to rebuild them, which took days.
Operationally there were some things that made our team angry. We couldn't figure out how to reparent a slave without completely re-cloning it from the new parent, even though it was completely up to date from the authoritative master at the time of the reparenting. Also, do you really have to take down the entire cluster to change max connections on a slave? Many such settings seemed to not be dynamic and must remain in sync across the entire chain. One of the things people seem to love about PG is how correct and proper it is respecting the sanctity of your data. As an ops person I'd much rather deal with slightly inconsistent replicas (common in MySQL) than have to fight with how rigid PG is.
If we're talking normal replication then I can tell you for certain that the built in replication system is completely rock solid (especially compared to MySQL).
If you're talking multi-master replication, then Citus is pretty solid, but not nearly as solid as replication.
If you're talking statement based replication, well, it's the same as all databases. Here be dragons.
The built-in replication may be good - but it's pretty new. Replication slots, which are IMO vital for making the built-in replication non-fragile, only arrived in 9.4, which is pretty recent of a release. I wonder how widely tested it is, given that? How many people are actually using it for large workloads?
Citrus I hadn't heard of though - that's interesting, thanks.
Two years do surely not confer the status of "battle-tested" but I wouldn't call it "recent" either. Then again, DB service level and a standard application service level might differ by a few nines/sigma here.
Okay, it's less recent than I remembered (how time flies!).
Still, users are typically fairly slow to update their DB software, so I suspect 9.4 is still a pretty small percentage of the installed base at this point.
Any replication / backup / restore process that is not regularly tested to verify that it works is fragile. Even more so when it's a series of manual steps to stonith, promote, etc. There's nothing particularly fragile about Postgres and there's a wealth of backup options ranging from logical (pg_dump), to replication (built-in WAL based), to physical archiving (pg_basebackup + WAL archiving). There are a number of cloud DBaaS vendors that run huge fleets of Postgres databases (Heroku and Amazon RDS come to mind). This is a solved problem.
The root of this clusterfuck for GitLab is that they didn't test anything. Looks like there was some cargo culting of "Yah we need backups so let's run something ... yada yada S3 ..." but no verification of anything actually being backed up, or error alerts for missed backups, let alone testing a full restoration from said backups.
My team and I switched from bitbucket to gitlab a few months ago and we love the transition. Gitlab provides a lot of value to me and my team who are learning how to code while working on side projects. Although we cannot send merge requests today because of this issue, we are all cheering them on. I’m very happy that they are so transparent about their issues because my team and I are learning so much from their report and insights here on HN comments. Good luck!
I like gitlab's issue tracking. Its faster (from a productivity standpoint) and easier to manage compared to Jira + Bucket, which feels a bit too bloated.
>2. Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.
>3. SH: It looks like pg_dump may be failing because PostgreSQL 9.2 binaries are being run instead of 9.6 binaries. This happens because omnibus only uses Pg 9.6 if data/PG_VERSION is set to 9.6, but on workers this file does not exist. As a result it defaults to 9.2, failing silently. No SQL dumps were made as a result. Fog gem may have cleaned out older backups.
>5. Our backups to S3 apparently don’t work either: the bucket is empty
I think we've all seen that with some kind of report or backup or that's regularly reported: an empty file or none at all is generated due to some silent error.
I highly recommend creating a monitoring check for file size for each automatically generated file.
At $work, we also generate quite a few config files (for Radius, DHCP servers, web servers, mail servers....). For those we have a mechanism that diffs the old and the new version, and rejects the new version if the diff exceeds a pre-defined percentage of the file size, escalating the decision to a human.
Thank you GitLab people for writing a log!
It's just as important as the service you are providing.
Keep up the good job and never be afraid to talk about your mistakes.
It would be also very educational if you could try to do a "5 whys" session and share it too.
The person who made the mistake deserves a bit of rest, that's for sure.
I wish she or he is supported by the team emotionally and not just being blamed.
> Work is interrupted due to this, and due to spam/high load on GitLab.com
This is why spam should be illegal. The advertisers, the ISPs harboring them or their country should be taken on for damages. This not only prevented Gitlab from doing business, but also people who depend on them from doing business. It's criminal.
I'm curious to know what strategy has been developed out of this regarding delivery of spam through creation of snippets. In the original GitLab First Incident report it noted "spammers were hammering the database by creating snippets, making it unstable". So many easily accessible platforms are out there that this method of spamming could be used on that it seems like a necessity to evaluate current workflows and identify where checks/balances can be inserted that would prevent this from happening again. Short of removal of snippets, there must be some method of snippet grepping that would put a pause on suspicious snippets, preventing the bulk of submissions along the lines of what GitLab initially received.
I'm another happy GitLab user, but things like this always kinda freak me out.
Do any of you use any repo-mirroring strategy? Something a little more automated than pushing to and maintaining separate remotes? For example, would it be worth it to spin up a self-hosted GitLab instance and then script nightly pulls from GitLab.com?
So, maybe a stupid question from a frontend dev who doesn't deal with these systems at all, but aren't these systems usually part of a cluster with read replicas? Blowing away the contents of one box shouldn't destroy the cluster right? I thought the primary/secondary pattern was really common among relational databases and failover boxes and other measures were standard practice. Was the command executed on all machines? Is the cluster treated as one file system? Please excuse the ignorance.
You'd be surprised how many companies out there do not run read replica and don't really have anything for disaster recovery (same data center, same country good luck when there's an outage).
I am :/ ... currently maintaining service for 500 high-volume businesses 24x7x365 in 9 timezones. Luckily the product and infrastructure is pretty stable and problems occur maybe once a quarter.
But the constant nagging in the back of your head that shit can go wrong at any second is draining and has been the biggest stressor in my life for a long time now.
My S.O. still gets mildly upset when I pack up the laptop on our way out to a fancy dinner, or disappear with my laptop when visiting her parents, but the fact that our life goals are aligned is the saving grace of all these situations. We both know what we want out of the next 5 years of our lives and are willing to sacrifice to achieve this goal (long term financial security).
First off, my heartfelt commiserations for the GitLab team here. My suggestion: Start watching an hour-long video; the rsync will finish right when it gets to the good part!
I wonder if a future project might be to have the DB-stored stuff use Git as a replication back-end. Like, for example, having each issue be a directory, and individual comments be JSON files. It would never (normally) be the data store "of record" (the DB would), but maybe that would work as a backup?
Gerrit is working on just this. It uses `refs/meta/config` branch for project configuration and is moving its database dependencies into a git database. Reviews are stored in refs/changes/* . Backing-up a project & verifying it's integrity is simple as `git clone --mirror`
The best system administrator is the one that has learned from their catastrophic fuck up.
To that effect, I still have the same job as I did before I ran "yum update" without knowing it attempts to do in place kernel upgrades. Which resulted in a corrupted RedHat installation on a server we could not turn off.
There is learning from a catastrophic fuck up, and then there is incompetence. Backups is like Day 1, SysAdmin 101. I can't quite grasp how so many different backup systems were left unchecked. Every morning I receive messages saying everything is fine, yet I still go into the backup systems, to make sure they actually did run. In case there was issue with the system alerting me.
> There is learning from a catastrophic fuck up, and then there is incompetence.
We all start at incompetence, but eventually we — wait for it — learn from our experiences. Would you believe that Caesar, Michael Jordan and Steve Wozniak once were so incompetent that they couldn't even control their bowels or tie their shoes? They learned.
Is it possible that the guys in the team running GitLab's operations were misplaced? Certainly — that's a management issue. And I can guar-an-tee you that GitLab now has a team of ops guys who viscerally understand the need for good backups: they'd be insane to disperse that team to the winds.
There's no excuse for backups not being setup, period. For such a high profile site, and the rigorous hiring circus they put candidates through. This doesn't fall under "a learning experience". I wish them luck, but this is just gross negligence.
Funny that I haven't seen the following anecdote yet: One is none, two is one (== none). AFAICT there were only 2 Postgres instances? What gives? How would you ever feel comfortable when one goes down?
How we deal with recovery:
- run DB servers on ZFS
- built a tool to orchestrate snapshotting (every 15 minutes) using an external Mutex to distribute snapshot creation for best recover accuracy. You could also have increased retention over time like:
Recover: choose point in time closest to fckup, the tool automatically elects the DB with closest (earlier than given time) snapshot. All other slaves are restored before that point in time and roll forward to the active state of new "master".
Instead of executing worst case recovery plans by copying data to at least 6 (minimal) db read slaves we can recover in minutes with minimal data loss (especially when you consider downtime == data loss).
There are cases where a setup like this would be a no go (think of companies where having lost transactions are absolutely devastating) but I don't think Gitlab is one of those.
Side effect of ZFS is being able to ship blocks of data as offsite backups (instead of dumping), able to `zpool import` anywhere, checksumming, compression etc etc..
> It's backup day today so I'm pissed off. Being the BOFH, however, does have it's advantages. I reassign null to be the tape device - it's so much more economical on my time as I don't have to keep getting up to change tapes every 5 minutes. And it speeds up backups too, so it can't be all bad can it? Of course not. --bofh, episode #1
I really feel for everyone involved. Knowing Gitlab, they'll learn and become better for it.
I've been using the PS1 trick they mention for the last couple of years and I've found it to be a really good visual check (red=prod, yellow=staging, green=dev). We then also apply the colorscheme to the header in our admin pages too. Those of us that are jumping between environments are a big risk to data :-)
a user for using a repository as some form of CDN, resulting in 47 000 IPs signing in using the same account
I'd be interested in how this occurs. Simply linking a raw file in a repository would surely not require a sign in. Did someone come up with some way of automatically using credentials on a download link?
47 000 simultaneous users suggests that wouldn't be a small project that did so.
As a complete guess, something like using sessions persisted back to the PostgreSQL database, without something like memcached in front of it.
With that kind of approach it could be trying to update a session table (with new IP address?) for literally every page load by the 47,000 people. Which would probably suck. ;)
Backups sucked for the starting in 8.15 on our instances of GLE, because someone decided to add "readable" date stamp in addition to unix timestamp in backup file name without proper testing, which caused many issues. It was somewhat fixed, but I do still issues in 8.16.
I'm not complaining, but backup/restore is important part, with 100% test coverage and daily backup/restore runs.
Yes, when I was responsible for databases and servers 'red alert' file where all the worst case scenario's were described with the recovery procedures which were tested every half year or so. This came about after these scenario's happened one after the other and I had to fix them manually.
One of them, a hard disk crash, were a tape backup failed for some reason, and the tape from two days ago was to outdated. I did a filesystem recovery, then a database recovery and consistency check, mounted the files and looked at which tables were corrupted and restored those from backup. I didn't want to through this (with a huge time pressure) ever again.
After that we decided to check if all backups succeeded and were consistent so everyone on the team could restore within an hour.
At my $currentJob, we restore 30TB worth of database restores every day in automated processes.
We do this because the DBA insisted that the DB backup process was fine. We tried to restore 3 backups as a test, and they all failed. We no longer have DBA's. We have automated procedures and very thorough testing. Zero failed restorations since then.
I think the fact that no one seems to even remember that GitHub had a similar incident in 2010 https://github.com/blog/744-today-s-outage is a good thing to keep in mind for the GitLab team. This too shall pass etc.
YP thinks that perhaps pg_basebackup is being super pedantic about there being an empty data directory, decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com
My heart honestly goes out to YP. This is a terrible sysops _oh shit_ moment.
The recovery process seems very slow with ~50mbit/sec. Could that be an issue related to cloud providers? I heard that issue quite often when dealing with AWS/Azure. Even HDDs should have much higher throughput for that kind of transfer.
If they had dedicated hardware in 2 datacentres on the same continent, copying between those servers should easily be possible at 250mbit/s or more (from my experience). Especially as they seem to copy at the US east coast, where it's now night.
For me, that would be a serious issue dealing with cloud providers. If I have a server with a 250mbit connection, I expect to be able to copy data between datacentres at that speed. And I never had problems with OVH, Hetzner and the like.
Well...never delete. Rename the directory someplace else so you can get it back if the deletion was a bad plan.
Also helpful to make the window background color different, or some other highly conspicuous visual difference when working on several very similar production machines.
I've seen three backup methods fail when it came time for an emergency restore, due to lack of competence, confusion and lack of regular restore tests.
I have at least 10 private repos on GitLab, and many public ones. Even so, this is no big deal to me. That's the beauty of git. Even if all of their backups fail, I can just do a push and everything is back up there.
I just hope my laptop doesn't die before they get it back online.
EDIT: Was fun to put this little command together. Run this from your code directory, and it will push all of your gitlab repos. I'm going to run it when GitLab is back online.
> Git repositories are NOT lost, we can recreate all of the projects whose user/group existed before the data loss, but we cannot restore any of these projects issues, etc.
Your fancy snippet will report that it has pushed no changes. The data that was lost was new issues, PRs, issue comments, and so on; I've never heard of anyone keeping backups of these on their local laptops.
> I've never heard of anyone keeping backups of these on their local laptops.
Hmm... That's an interesting idea!
You could do that on a separate (empty) branch. Maybe call it `__project`, and you could just have folders of markdown files. You could have two root folders for `issues/` and 'pull_requests/', and two subfolders in each for `./open/` and `./closed/`. And a simple command-line tool + web UI. You could just edit the file to add a comment.
It would be really nice to have a history and backup of all of your issues. I also like the fact that you could create or edit issues offline.
Then you could also set up a 2-way sync between your repo and GitLab / GitHub / Trello.
That sort of "inline" issue tracking is a thing. I think Bugs Everywhere[1] is one of the more mature systems based on the idea. There are several others too[2], most of them unmaintained. There are also wiki-style systems based on the same idea.
I like to always try to automate stuff as much as possible to remove human error. It's easy to forget an arg or other parameter on a script or even know what the arg was for in the first place. Sounds like their last backup was 24 hrs ago. Having backups are like having good security, you don't realize how important they are until it's too late. Reminds me of this old meme:
What is gitlab storing in their database? From what I understand, the repos were untouched by the DB problems, so what is taking up a third of a terabyte of DB space?
As a small team made up of 3 companies that work closely together, we all use GitLab's services daily for our work. Thank you for the great service and we wish you a speedy and painless recovery!
I had to restore a rails app from a day old backup once. I actually manage to bring last day data as well by parsing POST/PUT/PATCH lines in rails log. This is painful, and you have to keep track of new ids for relations, but it "works" (obviously, there is info you can't retrieve that way, but in those situations, anything more than nothing is good).
> Our backups to S3 apparently don’t work either: the bucket is empty
followed by
> So in other words, out of 5 backup/replication techniques deployed none are working reliably or set up in the first place.
is no way to be running a public service with customer data. Did the person who set up that S3 job simply write a script or something and just go "yep, it's done" and walk away? Seriously?
Applicants for this position can expect the hiring process to follow the order below. Please keep in mind that applicants can be declined from the position at any stage of the process. To learn more about someone who may be conducting the interview, find her/his job title on our team page.
Qualified applicants receive a short questionnaire and coding exercise from our Global Recruiters
The review process for this role can take a little longer than usual but if in doubt, check in with the Global recruiter at any point.
Selected candidates will be invited to schedule a 45min screening call with our Global Recruiters
Next, candidates will be invited to schedule a first 45 minute behavioral interview with the Infrastructure Lead
Candidates will then be invited to schedule a 45 minute technical interview with a Production Engineer
Candidates will be invited to schedule a third interview with our VP of Engineering
Finally, candidates will have a 50 minute interview with our CEO
Successful candidates will subsequently be made an offer via email
I went through all of the insane interviews except the CEO interview for the production engineer position and was rejected. I brought quite a bit of high-level operational experience from a few different places, including some pretty high-traffic video sites, HPC and a lucrative e-commerce site.
The reason for my rejection? I had an ongoing side project and development/consulting gig which had been paying the bills for a few years which they thought I 'misrepresented'. I guess they assumed I was still working there (I was and would continue to do consulting and development on my own time), but the biggest LOL of the whole shitty process was their reasoning: "We are very risk-averse in our hiring." No shit.
I guess they weren't risk-averse enough in their operations though, so I'm glad I didn't get through. It sounds like it would have been an uphill battle all the way to make changes to keep things sane.
And having to crowdsource your move from cloud to your own hardware? Listen to your people, they already had good ideas, all of which were documented in your open issue tracker.
>I guess they weren't risk-averse enough in their operations though, so I'm glad I didn't get through. It sounds like it would have been an uphill battle all the way to make changes to keep things sane.
I think they were risk averse which is the problem except they were judging against the wrong candidate profile. The risk adverseness meant they never hired anyone who diverged from their incorrect profile.
>> candidates will be invited to schedule a first 45 minute behavioral interview with the Infrastructure Lead
Yes, go right ahead and filter out some (disclaimer before the rant: some, not all) of the best talent. The kind of potential employee that gets rejected due to perceived personality problems is exactly the kind of person who would tell management to shove a stick up their ass for demanding a 2 week deadline for a project requiring 3 months to execute properly.
Maybe if GitLab had hired the best talent, instead of the best "behavioral/cultural fit", at least one of their 5 backup systems would have been functional. Many people who are perfectionists in their craft, who would never have allowed this kind of failure to take place under their watch, come with abrasive personalities. If you only hire those who are submissive during the interviewing process, you will get exactly what you chose - people with no backbone to push back against unreasonable business expectations.
Case in point: would you want to hire me based on this comment of mine? Hell no! You're going to steer clear of me and give me an instant fail during a "behavioral interview", because you can't look past my belligerence to understand that there is value in having employees who obsess over the little things like having systems that do what the fuck they're supposed to do, rather than being able to give a conformant first impression full of social prowess. "Whoa, he used the word 'fuck' to hammer home his point; definitely avoid hiring this guy!"
tldr; Sometimes, people who are "talented" or "skilled" get to that point by being obsessive freaks who sit at home in the dark all night hacking away at stuff, with no social lives. The result can be someone who knows what they are doing because they invest all their personal free time into the domain, but consequently has absolutely no social skills to put on display.
shorter tldr; Businesses focus on the liability of a person without considering the potential.
Abrasive personalities don't tend to create safe&sane process, many would just yell at the people when mistakes happen. That is not the same. Able not to make mistakes and able to create process with enough safeguards so that mistakes are guaranteed not to happen are sometimes in opposition. Obsessive freak who tend not to do little mistakes may resist the change toward safer process, preferring to play blame game.
Moreover, many with abrasive personalities are the ones who demand and take for granted excessive overtime the moment they are in senior or lead position. "I am too tired and it is late at night lets do it tomorrow" is not an option to abrasive personality.
As a rule, obsessive freak with no social skills you describe wont be able to effectively reorganize the team nor create process nor (if in extreme version) work within process someone else created. He is more likely to end up in endless quarrels about petty differences in coding style.
Ability to create, fight for and enforce sane process is not the same as being abrasive.
He is however absolutely fine when you have isolated position.
As others have said, abrasiveness isn't a virtue especially in a role that requires others to change their behavior. There are in fact diplomatic ways to convey the same information which have a much higher chance of causing others to listen and getting everyone to adopt new processes. That someone is diplomatic and knows/care about their stuff are not mutually exclusive.
That said, depending on their interview process they may be selecting out candidates who even diplomatically disagree about the existing ops processes which results in less than stellar hires.
Not vetting people based on behavior is perilously close to
accepting the old adage "Say what you like about Mussolini, but at least he made the trains run on time".
5 interviews isn't -that- insane, it's on the high end though.
But I do know for sure they have turned down extremely technically gifted people for "non-technical" reasons, the kind of people that would ensure this sort of disaster wouldn't happen.
I'm biased because I went through the process and was rejected, but I agree. There was more focus on asinine 'behaviorial' questions than the real reason you're talking to someone: are you (relatively) sane, and more importantly are you capable, can you do the job? Everything else is secondary, in my opinion.
I wondered why they always have so many job reqs out, and why they're out for so long. It always seemed like an interesting organization to me, and I like the product, but everything about their hiring process seems to be shoeing people off. Perhaps this will be incentive to try onboarding people or at least to bring in some consultants to shore them up.
I think part of the problem, as it has already been touched on here is true Ops people are of a different breed to standard developers. They usually have similar coding ability (but different areas of expertise, automation and systems coding over business logic) but drastically different personalities.
This is an issue during hiring when companies have minimal Ops background. They (usually) attempt to interview the candidate as they would a normal dev position. This has 2 serious problems. 1) Real Ops candidates are going to seem lacklustre compared to web devs when giving web dev coding tasks and inteview questions. 2) You probably aren't going to ask them anything actually relevant to their ability to do Ops well because you just don't know.
This can usually be solved by hiring an experienced SRE manager. However those people are expensive and hard to find.
The next best thing is to have a gun SRE in your personal network that you can convince to join your team and help you build it out.
If you can't achieve that for whatever reason then you really just have to hire based on experience and suck up the cost of hiring an ex-Google/Facebook/Netflix/Twitter SRE that you can be reasonably confident will be good.
That is still hard though because those sorts of people know there is a ton of shit shoveling to be done if you are the first real SRE on board.. you have to clean up the mess left in the wake of "devs doing ops" which is not Devops btw. Devops is what is done by SREs. "devs doing ops" is how this sort of thing happens.
I couldn't agree more. (My followup rant is general, and I have no knowledge of what's going on at GL.)
> Real Ops candidates are going to seem lacklustre compared to web devs when giving web dev coding tasks and inteview questions.
This drives me insane. I think 50% of my comment history this year is about how not to botch an interview.
> You probably aren't going to ask them anything actually relevant to their ability to do Ops well because you just don't know.
Yes. You won't be able to quiz that person. You can take a thoughtful look at experience and talk about that, though.
> hire based on experience
Yup. One of the spook factors at any company is when I see it only hiring down, i.e., lower than the aggregate skill level of its current team. If the VP of engineering knows more about each concentration area than the people he's hiring, there's a storm a-brewin'.
It's worse: the technical interview for "some" positions apparently consists of pair-programming to resolve a Community Edition Issue which actually will (presumably, if it resolves the issue) become part of the product. No mention if that's paid time or not. That's shameful. I was considering using them, but if they're getting what is effectively free labor as part of their interview process my interest will wane considerably.
> Did the person who set up that S3 job simply write a
> script or something and just go "yep, it's done" and
> walk away?
I don't know of course, but one failure mode that has to be explicitly tested for is continual monitoring that the existing backup process is still working. We had a backup process at Blekko which stopped working once when an S3 credential that appeared unrelated was removed, as I recall it was a Nagios test that detected that the next set of backups were too small and got that fixed.
Yeah, the only solution is to test your backups - either automatically or manually through failure.
My small business uses a global network of cheap servers to provide low latency to customers, so we have encountered significant lengthy failures of disk or network once every few months, but hey, we can be damn sure our backups and deploy scripts work because we're forced to restore them all too often.
Working on a Gitlab project right now, just noticed the site was down, thanks to the team for working so hard to fix/rectify this mistake and being totally open about it.
Appreciate the openness and utility of Gitlab (as I've said in other threads), I'm sure it's frustrating to have this happen, but hang in there! services generally have 99.9% uptime anyway :)
And this is why I still use GitHub. It's a shame as GitLab was looking like a nice alliterative (for self hosting, might still try it some time), but if this sort of thing could even possibly happen.....
I didn't read to much into this, but they really didn't haven any backup on the databases?
I once rm-ed my home directory when I was writing and testing a script, but turned out the stuff like .m2, .ivy2 are huge and they are the first ones by default to be deleted by 'rm -rf'. So they kind of gave me some buffering time to figure out that something was wrong.
That's why I do not use github/gitlab/whatever to host the part of my code that is too critical to me. I push it to my ssh/git server and use local UI to interact with it instead.
Sometimes source code is very valuable and you just can not make any mistakes with it.
There's really no point doing intensive congnitive work for more then 8 hours straight. After that you go by instinct and muscle memory. Surprisingly a lot of tasks can be done, but you shouldn't do anything critical.
While this is sad to see it is a lesson to us all and I've shared it with coworkers who haven't taken disaster recovery as seriously as we should on our projects. Hopefully this will help raise this as a priority.
I wonder if they have some kind of protocol to modify production environments that was somehow overwritten, one person creates the bash scripts and a second one reviews and executes, never a single person.
I agree with this. Prepare for destroyed (burnt-down level) machine, for datacentre failure, for stolen home server, for scratched blu-ray archives - in short, for the worst.
And of course, hope for the best.
It's easier said than done in some companies where stake holders are always pushing for new stuff.
Unless those who fund what you're doing understand why disaster recovery is vital, you're going to see this.
Ideally you want devops in such a state that you create new lower environments that mirror production, complete with state/backup restoration, that's run automatically every week.
rm is a part of coreutils, right? Why not just substitute it with a less destructive script moving files to /mnt/oops/<origpath-date>/<filename> in the next release? Badasses can echo badass > /etc/rm.conf to get the original behavior.
Admins have to reinvent that bicycle for decades, stop it now, please!
I only hope you prepare for DDoS attacks of this kind. Github was fighting with this stuff the last several years... you maybe only have luck. Please prepare!
So tl;dr: Gitlab is experiencing heavy DOS attacks that created so much data that replication stopped working. In the process of getting replication to work again, "YP" wanted to delete the empty data directory of the slave DB server, but accidentally deleted it on the master DB server. Out of 5 backup/replication techniques they use not one is working reliably. YP manually created the backup they could use 6 hours ago by chance.
Tell me if I misunderstood something. I hope the customer I met last week does not remember I ever recommended GitLab to him.
The major Azure problems like two years ago were documented by GitLab in similar manner. I find the openess a good thing, even in not so good times. Thumbs up.
What I'm going to criticize is the excess of transparency:
You absolutely DO NOT publish postmortems referencing actions by NAMED individuals, EVER.
From reading the whole report it's clear that the group is at fault, not a single individual. But most people won't read the whole thing, even less people will try to understand the whole picture. That's why failures are always attributed publicly to the whole team, and actions by individuals are handled internally only.
And they're making it even worse by livestreaming the thing! It's like having your boss looking over your shoulder but a million times worse...