First, my apologies for the outage. I consider our package infrastructure as critical infrastructure, both for the free and commercial versions of Docker. It's true that we offer better support for the commercial version (it's one if its features), but that should not apply to fundamental things like being able to download your packages.
The team is working on the issue and will continue to give updates here. We are taking this seriously.
Some of you pointed out that the response time and use of communication channels seem inadequate, for example the @dockerststus bot has not mentioned the issue when it was detected. I share the opinion but I don't know the full story yet; the post-mortem will tell us for sure what went wrong. At the moment the team is focusing on fixing the issue and I don't want to distract them from that.
Once the post-mortem identifies what went wrong, we will take appropriate corrective action. I suspect part of it will be better coordination between core engineers and infrastructure engineers (2 distinct groups within Docker).
> At the moment the team is focusing on fixing the issue
> and I don't want to distract them from that.
That might be ok for feature teams, but for infrastructure tools/services, it's very frustrating for users (devs) to be kept in the dark on the progress of the fix.
At work, the incident response starts with identifying Investigators (to find and fix the problem) and a Communicator (to update channel topics, send the outage email & periodic updates, field first-line questions about the incident, and to contact those most affected by the incident so they don't get surprised/try to fix it themselves). The person who starts the incident is the Coordinator, who assigns the roles, escalates if more help is needed, tries to unblock investigations, and turns facts from the investigators into status updates for the communicator.
i will provide an opposing viewpoint which i'm sure many people do not agree with.
if a service i use is down, all i want is an acknowledgement and that "we are working on it right now with high priority". i want all available resources to be fixing the problem.
my anxiety over powerlessness in relying on others during a crisis manifests in other ways, like figuring out why i'm at the mercy of this thing in the first place, and putting alternatives in place.
but during the crisis i'll just go do something else for an hour and then read the post mortem when it comes out.
Updates that are communicated are not the details of what's happening in the investigation, but things that are expected to be useful to users/clients. They are things such as the estimated time for problem resolution, updates on the scope of the problem (e.g. "this is an instrumentation problem" vs "this is an actual outage") or mitigation steps that could be applied by the clients. It is often very useful to know such things during an outage.
i don't think anyone, anywhere would have given you an accurate estimated figure of nearly 5 hours to fix this problem.
furthermore, even if you somehow could divine the future, telling the customer that you think an outage will last over half a working day is going to turn an extremely shitty situation into something even worse.
sometimes the resources available are suitable for different problems.
I consider high quality communicators to be a significant resource, and the problem to solve is to make sure everyone affected has the right information to make their next decision to mitigate the issue.
These communicators might not know how to solve the actual technical problem, and it'd be like putting monkeys on a typewriter to tell them to "all hands in fixing a configuration issue with the web server" for example.
I'd say apply the right people to the right problems.
I agree with you and we have a similar process at Docker. Part of what went wrong in this particular case is precisely that our infrastructure incident response process did not kick in. I am guessing that one conclusion of the post-mortem will be "make sure to handle open-source package distribution infrastructure like the rest of our infrastructure, including incident response checklist".
I think this is fine.
If you use a 3rd party repository, you can almost expect it to fail. If you rely on repositories so heavily, run Apt Cacher or mirror them on s3 or the like, its pretty simple.
This is a pretty common practice in any professional environment to contain all your dependencies - binaries, debs etc locally mirrored...
Can't wait for dockerhub to go down now... :troll:
This isn't doing devops, this is release management, which is something that traditional sysadmins do all the time. Making sure that the image file and repository information match up is pretty basic, and that a release is deployed correctly is pretty basic. A project like this, I'm surprised they don't have tools like nagios constantly checking to make sure downloads are working and that checksums, etc all match up, preferably on the servers before whatever load balancing system you have points at them. Deployment can, and should, be as atomic and possible, regardless of who is pushing the go button.
Docker's messaging is specifically anti-devops. They market containers as a "separation of concerns" between dev and ops, where dev is responsible for everything that's in the container, and ops is responsible for deploying the black box.
It's true that Docker helps separate dev concerns from ops concerns. But it doesn't prevent dev and ops teams from collaborating, or the same team to wear both hats - the most common aspects of a "devops" methodology in my experience.
In fact separation of concern makes collaboration more efficient, because everybody knows who is responsible for what. So you could argue that Docker actually facilitates "doing devops" if that's the methodology you choose.
You make a good point that "doing devops" is a methodology that you can choose to use or not. There is no moral hazard in not using it.
That said, having ops excluded from the architecture decisions that go into building containers is absolutely antithetical to "doing devops."
As for your assertion that people know who is responsible for what in your model, I argue that devs are usually not thinking ahead that they're going to be the ones on-call to fix whatever breaks in production in the middle of the night, because they're the only ones who know what's inside the container. Ops can't be responsible for fixing whatever is inside an artifact that it had no role in creating.
We're going to to have the inverse of the little-girl-smiling-at-the-house-fire meme. "Kubernetes is Running Fine, Dev Problem Now."
> That said, having ops excluded from the architecture decisions that go into building containers is absolutely antithetical to "doing devops".
My point precisely. Just because there is clean separation of concerns doesn't mean anyone needs to be excluded.
The methodology I've seen work best is one where people are not divided by skillset (dev/ops) but instead by area of responsibility (app/infrastructure). Then you embed people from different functional areas into app teams: devs of course, but also security engineer, operations specialist, and various domain experts. From the point of view of IT, you're influencing the development of the app before development to make sure it follows best practices.
A second important point is that, just because you're running a container built by someone else doesn't mean you can't enforce good operations practices. For example, you can mandate that all the containers on your production swarm expose health information at a specific url prefix and pass CVE scanning - or they will not be deployed.
DevOps isn't (just) Dev doing Ops and Ops doing dev, though. It's about understanding each teams domain and facilitating communication. Nothing about Docker limits that, per se.
e.g:
> I argue that devs are usually not thinking ahead that they're going to be the ones on-call to fix whatever breaks in production in the middle of the night
That's not a Docker issue. That's just a DevOps culture that is incomplete.
Maybe you are being downvoted because your comment is too short, but it's something that crossed my mind when reading the end of the parent comment.
The best thing is, this might end up being the best proof of why you need to embrace devops methodologies and maybe take advantage of tools like Docker while doing so :)
From the GitHub issue thread, I see a lot of people being angry for their production deployments failing. If you directly point to an external repo in your production environment deployments, you better not be surprised when it goes down. Because shit always happens.
If you want your deployments to be independent of the outside world, design them that way!
Same thought here; uncached (re)installs feel like bad practice. Bad disaster planning... But also prone to inconsistencies in your software deployments.
Seriously. My favorites were the complainers sagely tutting about Docker's "single points of failure", and in the same breath complaining about Docker causing pain for the complainers' infrastructure.
Yep, even for small personal projects I deploy docker apps via saving an image repository and transferring it to the production server from my laptop. It is so easy to do so I just cannot see justifications for not doing that especially in any kind of commercial deployment.
Maybe this is the norm in big enterprises, but I have not actually come across any company which hosts a local package repository for commonly available packages.
In big enterprises, core production infrastructure frequently has NO access to internet, ever.
Most places I've ever worked in will have local repositories, procedures and timelines for anything from Microsoft and OS updates, to development stacks and libraries.
Neither workstations nor servers get updated directly from external/vendor/open repositories - it is all managed in-house.
Slower, yes; more work, yes; but that's exactly the type of issue it's meant to prevent :)
Really? We certainly do here. Linux packages are mirrored. Programming dependencies are mirrored via tools like Nexus or Artifactory.
Like tsuresh said - stuff happens. What if you internet connection went down for a long period of time. You couldn't continue working. It takes very little to setup, gives you fall over but also makes installing dependencies sooo much faster.
You don't need a seamless, robust process for dealing with the occasional remote failure (especially when there are mirrors), but you can for example save snapshots of dependencies.
You should be able to do something in an emergency, even if it requires manual intervention. If you can only shrug and wait, that's bad.
At work, we've got our prereqs stored in a combination of Artifactory and Perforce. Even for my own personal projects, I'll fetch packages from the Internet when I'm setting up my dev environment, but actually doing a build only uses packages that I've got locally.
It's a little mind-boggling to me that anyone would rely on the constant availability of a free external service that's integral to putting their product together. I handle timestamps for codesigning through free public sites, but I've also got a list of 8 different timestamp servers to use, and switch between them if there's a failure.
You don't see well written, solidly thought out code to be the norm, either,
for pretty much the same reason. It takes experience and guided thought to get
to this point, and seasoned sysadmins (who have this worked out) aren't
exactly the crowd considered to be sexy nowadays.
Infrastructure that works has very little drama. Without drama, you're not in anyone's sphere of attention, and being outside that, there's no reason to be sexy.
I agree. I'm a sysadmin myself, and the best congratulations I've ever had
was after heavy rewrite of a firewall script (old one was a mix of three
different styles, unmanagable mess after several years of work), when my
colleague asked me when I'm going to deploy the new firewall, two or three
weeks after it actually went into production. It was so smooth that nobody
noticed.
Everywhere I've worked, mostly very small companies for the past decade, has kept local mirrors of critical infrastructure we depend on, specifically distro repositories, CPAN, etc. It's just a sane best practice. It really doesn't take all that much work to run apt-mirror or reprepro.
Devops here for a startup: We run our own repo for critical packages specifically to ensure AMI baking and deploys do not break when a package isn't available for whatever reason (and we pin to the package version).
After reading the bug report I am surprised so many people are using a remote package repositories for their build machines builds... then again I'm not too surprised I guess.
I'm not that familiar with Docker but I am of package/dep management (from deb, jars, npm, eggs etc) and you most certainly want to use a mirrored package repository (jfrog, sonatype, or whatever) for this reason and many more other reasons (bandwidth, security, control, etc).
So if you did have issues with the outage I would look into one of those mirroring tools. At the minimum it will speed up your builds.
I wonder how this all works if you have a really small, non critical systems and you are using SaaS und PaaS infrastructure (hello cloudcomputing, not doing stuff in house) like Travis CI where you are not in control of their repositories. This kind of services (like SaaS CI) are not new but making life a lot easier than doing your own CI (looking at you Jenkins). Not everybody wants to replicate whole repos for everything. Also some people (like me) want to spread open source software systems (which will not install themself on the client side if one component fails on the Internet) and not mission critical single point services.
Truth here, except running your own repos to add immutability shouldn't be something you have to do for infrastructure. Aptitude, and the package manager ilk like it, need to die.
You don't have to run your own repos for immutability. You just have to use repositories that say up front what will - or will not - change, and then don't do anything on top of them that expects something different from what that particular repo provides.
The question is why would anyone would expect immutability after pointing their package tools at a mutable repository?
Probably broken, since a competent attacker would have been able to avoid creating a checksum mismatch.
My company's actually done the same thing before (same error), by putting Cloudfront in front of our APT repo -- it cached the main packages file inappropriately, causing the checksum mismatch.
Yeah... the entire point of an on-call rotation is to specify who is available for incident response... if everybody is on the same plane at once then by definition nobody is on-call.
There was a comment a bit below that suggested that those who paid for commercial support got it 24/7, but if that's true, id imagine the fix that commercial support would have given to paying customers would have fixed it for everyone else too...
I believe commercial releases are downloaded from a separate infrastructure (to be confirmed).
Either way, the availability of Docker packages, free or commercial, is critical infrastructure and we should treat it as such. IMO our primary infrastructure team should have been involved, and someone should be on call for this. We'll do a post-mortem, find the root cause, and take corrective action as needed.
Part of a healthy community is accepting that people disagree a lot, have different values, and communicate their ideas in very different ways.
Where we draw the line is if people are being intimidated, bullied, insulted, or anything that even remotely resembles harassment.
Although I personally feel that some of the comments in that thread are pretty unfair and poorly informed, they don't seem to violate the social contract.
The commercial releases are different to the open source ones (they have different patches, and they do different releases), and have a different release schedule.
Captain Blackadder: Baldrick, what are you doing out there?
Private Baldrick: I'm carving something on a bullet, sir.
Captain Blackadder: What are you craving?
Private Baldrick: I'm carving "Baldrick", sir.
Captain Blackadder: Why?
Private Baldrick: It's part of a cunning plan, sir.
Captain Blackadder: Of course it is.
Private Baldrick: You know how they say that somewhere there's a bullet with your name on it?
Captain Blackadder: Yes?
Private Baldrick: Well I thought that if I owned the bullet with my name on it, I'll never get hit by it. Cause I'll never shoot myself...
Captain Blackadder: Oh, shame!
Private Baldrick: And the chances of there being two bullets with my name on it are very small indeed.
Captain Blackadder: Yes, it's not the only thing that is "very small indeed". Your brain for example- is brain's so minute, Baldrick, that if a hungry cannibal cracked your head open, there wouldn't be enough to cover a small water biscuit.
Assume N is the total number of buses. Where N=1 your solution works. For N>1 the engineers can still be hit by a bus. This also ignores all other vehicles such as garbage trucks, semis and stealth bombers, all of which can take out your bus.
I don't think this would surprise anyone that has used Docker Hub in their CD pipeline. So many reliability issues.
We moved off it last year. When we went to cancel our subscription the other month downgrades/cancellations were broken on the site as a known issue; had to open a support ticket. Most of the UI issues were still present along with some new ones.
It's scary how most people in that thread seem to be more concerned about forcing an installation, rather than pause and consider why the hashes might be wrong and why it might not be a good idea to install debs with incorrect hashes.
If the apt repo was compromised (but the signing keys were not), this is very likely exactly the symptom that would appear.
> If the apt repo was compromised (but the signing keys were not), this is very likely exactly the symptom that would appear.
I don't think that's correct. It would pass a checksum test and fail a signature test with a "W: GPG Error". The checksum test is not about cryptographic security, it's just about files referenced by the Packages file having the same hash that the Packages file declares them to have. You don't need any signing keys to make that happen.
What's more suspicious: Bad hashes or bad signatures? What would an attacker choose if their goal was to get as many people as possible to force install?
It's impossible to force install the packages when they have bad hashes (hence the severe breakage here), and it is possible to install the packages when they have bad signatures if you didn't import the gpg key or don't run with signature checking.
So I'd guess a rational attacker would choose a bad signature. But attackers can be irrational; it doesn't prove it's not an attack. Just not my intuition.
This is a really bad title. There is nothing wrong with either Ubuntu's or Debian's repositories. The problem is with Docker's repositories of Ubuntu/Debian packages.
I'm a bit disappointed that people are willing to make public criticisms of Docker when it's their builds that are failing. They made the decision to depend on a resource that could be unavailable for a large number of reasons entirely unrelated to Docker or their infrastructure.
Just like the node builds that failed this should cause you to rethink how you mirror or cache remote resources not prompt you to complain about your broken builds on a github issue page. There may be things you'll never be able to fully mirror or cache (or could just be entirely impractical) but an apt repository is definitely not one of them.
... which is why the clever sysop mirrors his packages and tests if an update goes OK before updating the mirror.
If you're running more than three machines or regularly (re)deploy VMs, it is a sign of civilization to use your mirror instead of putting your load on (often) donated resources.
It's the same stupid attitude of "hey let's outsource dependency hosting" that has led to the leftpad NPM desaster and will lead to countless more such desasters in the future.
People, mirror your dependencies locally, archive their old versions and always test what happens if the outside Internet breaks down. If your software fails to build when the NOCs uplink goes down, you've screwed up.
I often wonder why the community's response to issues with an open/free/community package is to give the maintainers a strong argument to discontinue it in favour of a commercial one, or just abandon it altogether.
I think this is a combination of chains which are dependent on eachother, especially when you use Travis CI: 1st) apt-get not flexible enough to ignore that error on apt-get update 2nd) Travis CI having so much external stuff installed, it's a big big image which has more failure points 3rd) Docker repo failed.
Outages or mis-configurations can happen to pretty much any source of packages you use, be it debian, pypi, npm, bower or maven repositories, or source control. Anybody remember left-pad?
So as soon as you depend heavily on external sources, you should start to think about maintaining your own mirror. Software like pulp and nexus are pretty versatile, and give you a good amount of control over your upstream sources.
I am copying it below:
<<< Hi everyone. I work at Docker.
First, my apologies for the outage. I consider our package infrastructure as critical infrastructure, both for the free and commercial versions of Docker. It's true that we offer better support for the commercial version (it's one if its features), but that should not apply to fundamental things like being able to download your packages.
The team is working on the issue and will continue to give updates here. We are taking this seriously.
Some of you pointed out that the response time and use of communication channels seem inadequate, for example the @dockerststus bot has not mentioned the issue when it was detected. I share the opinion but I don't know the full story yet; the post-mortem will tell us for sure what went wrong. At the moment the team is focusing on fixing the issue and I don't want to distract them from that.
Once the post-mortem identifies what went wrong, we will take appropriate corrective action. I suspect part of it will be better coordination between core engineers and infrastructure engineers (2 distinct groups within Docker).
Thanks and sorry again for the inconvenience. >>>