Docker was unavailable in Ubuntu/Debian repos

shykes · on June 2, 2016

Hi, I work at Docker. Here is my reply on the github thread: https://github.com/docker/docker/issues/23203#issuecomment-2...

I am copying it below:

<<< Hi everyone. I work at Docker.

First, my apologies for the outage. I consider our package infrastructure as critical infrastructure, both for the free and commercial versions of Docker. It's true that we offer better support for the commercial version (it's one if its features), but that should not apply to fundamental things like being able to download your packages.

The team is working on the issue and will continue to give updates here. We are taking this seriously.

Some of you pointed out that the response time and use of communication channels seem inadequate, for example the @dockerststus bot has not mentioned the issue when it was detected. I share the opinion but I don't know the full story yet; the post-mortem will tell us for sure what went wrong. At the moment the team is focusing on fixing the issue and I don't want to distract them from that.

Once the post-mortem identifies what went wrong, we will take appropriate corrective action. I suspect part of it will be better coordination between core engineers and infrastructure engineers (2 distinct groups within Docker).

Thanks and sorry again for the inconvenience. >>>

thaJeztah · on June 2, 2016

Hi! I work at Docker as well; and happy to tell that we found the cause of this issue, and managed to resolve it.

Be sure to clean your apt cache before trying again (apt-get clean && apt-get update) if you're still seeing this.

chrisbroadfoot · on June 2, 2016

Have you got a link to a post mortem?

falsedan · on June 2, 2016

  > At the moment the team is focusing on fixing the issue
  > and I don't want to distract them from that.

That might be ok for feature teams, but for infrastructure tools/services, it's very frustrating for users (devs) to be kept in the dark on the progress of the fix.

At work, the incident response starts with identifying Investigators (to find and fix the problem) and a Communicator (to update channel topics, send the outage email & periodic updates, field first-line questions about the incident, and to contact those most affected by the incident so they don't get surprised/try to fix it themselves). The person who starts the incident is the Coordinator, who assigns the roles, escalates if more help is needed, tries to unblock investigations, and turns facts from the investigators into status updates for the communicator.

beachstartup · on June 2, 2016

i will provide an opposing viewpoint which i'm sure many people do not agree with.

if a service i use is down, all i want is an acknowledgement and that "we are working on it right now with high priority". i want all available resources to be fixing the problem.

my anxiety over powerlessness in relying on others during a crisis manifests in other ways, like figuring out why i'm at the mercy of this thing in the first place, and putting alternatives in place.

but during the crisis i'll just go do something else for an hour and then read the post mortem when it comes out.

robryk · on June 2, 2016

Updates that are communicated are not the details of what's happening in the investigation, but things that are expected to be useful to users/clients. They are things such as the estimated time for problem resolution, updates on the scope of the problem (e.g. "this is an instrumentation problem" vs "this is an actual outage") or mitigation steps that could be applied by the clients. It is often very useful to know such things during an outage.

beachstartup · on June 2, 2016

i don't think anyone, anywhere would have given you an accurate estimated figure of nearly 5 hours to fix this problem.

furthermore, even if you somehow could divine the future, telling the customer that you think an outage will last over half a working day is going to turn an extremely shitty situation into something even worse.

barkingcat · on June 2, 2016

sometimes the resources available are suitable for different problems.

I consider high quality communicators to be a significant resource, and the problem to solve is to make sure everyone affected has the right information to make their next decision to mitigate the issue.

These communicators might not know how to solve the actual technical problem, and it'd be like putting monkeys on a typewriter to tell them to "all hands in fixing a configuration issue with the web server" for example.

I'd say apply the right people to the right problems.

vacri · on June 2, 2016

Keeping the affected users informed is part of 'fixing the problem'.

Sometimes the problem is even something that some users will be able to work around if they know what it is.

shykes · on June 3, 2016

I agree with you and we have a similar process at Docker. Part of what went wrong in this particular case is precisely that our infrastructure incident response process did not kick in. I am guessing that one conclusion of the post-mortem will be "make sure to handle open-source package distribution infrastructure like the rest of our infrastructure, including incident response checklist".

willejs · on June 2, 2016

I think this is fine. If you use a 3rd party repository, you can almost expect it to fail. If you rely on repositories so heavily, run Apt Cacher or mirror them on s3 or the like, its pretty simple. This is a pretty common practice in any professional environment to contain all your dependencies - binaries, debs etc locally mirrored... Can't wait for dockerhub to go down now... :troll:

de0x · on June 2, 2016

[flagged]

Sanddancer · on June 2, 2016

This isn't doing devops, this is release management, which is something that traditional sysadmins do all the time. Making sure that the image file and repository information match up is pretty basic, and that a release is deployed correctly is pretty basic. A project like this, I'm surprised they don't have tools like nagios constantly checking to make sure downloads are working and that checksums, etc all match up, preferably on the servers before whatever load balancing system you have points at them. Deployment can, and should, be as atomic and possible, regardless of who is pushing the go button.

alrs · on June 2, 2016

Docker's messaging is specifically anti-devops. They market containers as a "separation of concerns" between dev and ops, where dev is responsible for everything that's in the container, and ops is responsible for deploying the black box.

shykes · on June 2, 2016

Disclaimer: I work at Docker.

It's true that Docker helps separate dev concerns from ops concerns. But it doesn't prevent dev and ops teams from collaborating, or the same team to wear both hats - the most common aspects of a "devops" methodology in my experience.

In fact separation of concern makes collaboration more efficient, because everybody knows who is responsible for what. So you could argue that Docker actually facilitates "doing devops" if that's the methodology you choose.

alrs · on June 2, 2016

You make a good point that "doing devops" is a methodology that you can choose to use or not. There is no moral hazard in not using it.

That said, having ops excluded from the architecture decisions that go into building containers is absolutely antithetical to "doing devops."

As for your assertion that people know who is responsible for what in your model, I argue that devs are usually not thinking ahead that they're going to be the ones on-call to fix whatever breaks in production in the middle of the night, because they're the only ones who know what's inside the container. Ops can't be responsible for fixing whatever is inside an artifact that it had no role in creating.

We're going to to have the inverse of the little-girl-smiling-at-the-house-fire meme. "Kubernetes is Running Fine, Dev Problem Now."

shykes · on June 3, 2016

> That said, having ops excluded from the architecture decisions that go into building containers is absolutely antithetical to "doing devops".

My point precisely. Just because there is clean separation of concerns doesn't mean anyone needs to be excluded.

The methodology I've seen work best is one where people are not divided by skillset (dev/ops) but instead by area of responsibility (app/infrastructure). Then you embed people from different functional areas into app teams: devs of course, but also security engineer, operations specialist, and various domain experts. From the point of view of IT, you're influencing the development of the app before development to make sure it follows best practices.

A second important point is that, just because you're running a container built by someone else doesn't mean you can't enforce good operations practices. For example, you can mandate that all the containers on your production swarm expose health information at a specific url prefix and pass CVE scanning - or they will not be deployed.

ErrantX · on June 2, 2016

DevOps isn't (just) Dev doing Ops and Ops doing dev, though. It's about understanding each teams domain and facilitating communication. Nothing about Docker limits that, per se.

e.g:

> I argue that devs are usually not thinking ahead that they're going to be the ones on-call to fix whatever breaks in production in the middle of the night

That's not a Docker issue. That's just a DevOps culture that is incomplete.

NetStrikeForce · on June 2, 2016

Maybe you are being downvoted because your comment is too short, but it's something that crossed my mind when reading the end of the parent comment.

The best thing is, this might end up being the best proof of why you need to embrace devops methodologies and maybe take advantage of tools like Docker while doing so :)

jasonjei · on June 2, 2016

Thanks Solomon for the update and for bringing your wonderful product into the world!

tsuresh · on June 2, 2016

From the GitHub issue thread, I see a lot of people being angry for their production deployments failing. If you directly point to an external repo in your production environment deployments, you better not be surprised when it goes down. Because shit always happens.

If you want your deployments to be independent of the outside world, design them that way!

eloycoto · on June 2, 2016

I can't fully understand why people complain about that TBH. Free service and shit happens.

If you're running in production without having a APT mirror[0] in your local network something is wrong with you, no with docker apt repos.

[0] https://apt-mirror.github.io/

Ghostium · on June 2, 2016

It's the same solution for the npm left-pad problem. In my opinion every type of package repository should be mirrored & cached for production :)

gbraad · on June 2, 2016

Same thought here; uncached (re)installs feel like bad practice. Bad disaster planning... But also prone to inconsistencies in your software deployments.

facorreia · on June 2, 2016

Or aptly or artifactory.

SteveNuts · on June 2, 2016

Artifactory has been worth every penny

__jal · on June 2, 2016

Seriously. My favorites were the complainers sagely tutting about Docker's "single points of failure", and in the same breath complaining about Docker causing pain for the complainers' infrastructure.

Physician, heal thyself.

djb_hackernews · on June 2, 2016

The irony here is this is one of the problems docker solves, ie packaging your dependencies with your application ahead of time.

_0w8t · on June 2, 2016

Yep, even for small personal projects I deploy docker apps via saving an image repository and transferring it to the production server from my laptop. It is so easy to do so I just cannot see justifications for not doing that especially in any kind of commercial deployment.

karterk · on June 2, 2016

Maybe this is the norm in big enterprises, but I have not actually come across any company which hosts a local package repository for commonly available packages.

NikolaNovak · on June 2, 2016

In big enterprises, core production infrastructure frequently has NO access to internet, ever.

Most places I've ever worked in will have local repositories, procedures and timelines for anything from Microsoft and OS updates, to development stacks and libraries.

Neither workstations nor servers get updated directly from external/vendor/open repositories - it is all managed in-house.

Slower, yes; more work, yes; but that's exactly the type of issue it's meant to prevent :)

LTheobald · on June 2, 2016

Really? We certainly do here. Linux packages are mirrored. Programming dependencies are mirrored via tools like Nexus or Artifactory.

Like tsuresh said - stuff happens. What if you internet connection went down for a long period of time. You couldn't continue working. It takes very little to setup, gives you fall over but also makes installing dependencies sooo much faster.

TillE · on June 2, 2016

You don't need a seamless, robust process for dealing with the occasional remote failure (especially when there are mirrors), but you can for example save snapshots of dependencies.

You should be able to do something in an emergency, even if it requires manual intervention. If you can only shrug and wait, that's bad.

toomuchtodo · on June 2, 2016

> If you can only shrug and wait, that's bad.

Welcome to cloud computing!

khedoros · on June 2, 2016

At work, we've got our prereqs stored in a combination of Artifactory and Perforce. Even for my own personal projects, I'll fetch packages from the Internet when I'm setting up my dev environment, but actually doing a build only uses packages that I've got locally.

It's a little mind-boggling to me that anyone would rely on the constant availability of a free external service that's integral to putting their product together. I handle timestamps for codesigning through free public sites, but I've also got a list of 8 different timestamp servers to use, and switch between them if there's a failure.

dozzie · on June 2, 2016

You don't see well written, solidly thought out code to be the norm, either, for pretty much the same reason. It takes experience and guided thought to get to this point, and seasoned sysadmins (who have this worked out) aren't exactly the crowd considered to be sexy nowadays.

thwarted · on June 2, 2016

Infrastructure that works has very little drama. Without drama, you're not in anyone's sphere of attention, and being outside that, there's no reason to be sexy.

dozzie · on June 2, 2016

I agree. I'm a sysadmin myself, and the best congratulations I've ever had was after heavy rewrite of a firewall script (old one was a mix of three different styles, unmanagable mess after several years of work), when my colleague asked me when I'm going to deploy the new firewall, two or three weeks after it actually went into production. It was so smooth that nobody noticed.

tene · on June 2, 2016

Everywhere I've worked, mostly very small companies for the past decade, has kept local mirrors of critical infrastructure we depend on, specifically distro repositories, CPAN, etc. It's just a sane best practice. It really doesn't take all that much work to run apt-mirror or reprepro.

_skel · on June 2, 2016

My somewhat small company does this for Debian packages, maven artifacts, etc. We use Artifactory for all of it.

draw_down · on June 2, 2016

They do it where I work, because our ops and dev teams like it when deploys don't randomly break.

toomuchtodo · on June 2, 2016

Devops here for a startup: We run our own repo for critical packages specifically to ensure AMI baking and deploys do not break when a package isn't available for whatever reason (and we pin to the package version).

agentgt · on June 2, 2016

After reading the bug report I am surprised so many people are using a remote package repositories for their build machines builds... then again I'm not too surprised I guess.

I'm not that familiar with Docker but I am of package/dep management (from deb, jars, npm, eggs etc) and you most certainly want to use a mirrored package repository (jfrog, sonatype, or whatever) for this reason and many more other reasons (bandwidth, security, control, etc).

So if you did have issues with the outage I would look into one of those mirroring tools. At the minimum it will speed up your builds.

therealmarv · on June 2, 2016

I wonder how this all works if you have a really small, non critical systems and you are using SaaS und PaaS infrastructure (hello cloudcomputing, not doing stuff in house) like Travis CI where you are not in control of their repositories. This kind of services (like SaaS CI) are not new but making life a lot easier than doing your own CI (looking at you Jenkins). Not everybody wants to replicate whole repos for everything. Also some people (like me) want to spread open source software systems (which will not install themself on the client side if one component fails on the Internet) and not mission critical single point services.

Sphax · on June 2, 2016

Exactly. Some comments on this issue are making me irrationally mad.

kordless · on June 2, 2016

Truth here, except running your own repos to add immutability shouldn't be something you have to do for infrastructure. Aptitude, and the package manager ilk like it, need to die.

forgottenpass · on June 2, 2016

You don't have to run your own repos for immutability. You just have to use repositories that say up front what will - or will not - change, and then don't do anything on top of them that expects something different from what that particular repo provides.

The question is why would anyone would expect immutability after pointing their package tools at a mutable repository?

justinsaccount · on June 2, 2016

Title is misleading. The 'apt.dockerproject.org' host had a broken release file. This is not an Ubuntu or Debian maintained repository.

vox_mollis · on June 2, 2016

Broken, or compromised?

cjbprime · on June 2, 2016

Probably broken, since a competent attacker would have been able to avoid creating a checksum mismatch.

My company's actually done the same thing before (same error), by putting Cloudfront in front of our APT repo -- it cached the main packages file inappropriately, causing the checksum mismatch.

poooogles · on June 2, 2016

>Does this mean that Docker -- a major infrastructure company -- does not have any on-call engineers available to fix this?

It appear to be that way. Reminds me when all of the reddit admins were stuck on a plane on the way back from a wedding [1].

Remember kids, improve your bus factor.

http://highscalability.com/blog/2013/8/26/reddit-lessons-lea...

Bromskloss · on June 2, 2016

> all of the reddit admins were stuck on a plane on the way back from a wedding

Clearly, we need to operate in cells, so that no one knows everybody and will have everybody over for weddings and other parties.

bitJericho · on June 2, 2016

Haha. Well its not uncommon to avoid putting an entire team on one plane. Pretty risky for a company to do that!

madgar · on June 3, 2016

Yeah... the entire point of an on-call rotation is to specify who is available for incident response... if everybody is on the same plane at once then by definition nobody is on-call.

bitJericho · on June 3, 2016

And in the extremely unlikely event of a plane crash, the company goes belly up.

oldmanhorton · on June 2, 2016

There was a comment a bit below that suggested that those who paid for commercial support got it 24/7, but if that's true, id imagine the fix that commercial support would have given to paying customers would have fixed it for everyone else too...

shykes · on June 2, 2016

Disclaimer: I work at Docker.

I believe commercial releases are downloaded from a separate infrastructure (to be confirmed).

Either way, the availability of Docker packages, free or commercial, is critical infrastructure and we should treat it as such. IMO our primary infrastructure team should have been involved, and someone should be on call for this. We'll do a post-mortem, find the root cause, and take corrective action as needed.

Apologies for the inconvenience.

voltagex_ · on June 2, 2016

You may want to lock that GitHub thread soon, it's getting argumentative and not very helpful.

shykes · on June 2, 2016

> You may want to lock that GitHub thread soon, it's getting argumentative and not very helpful.

In open-source we call that "thursday" :)

zjaffee · on June 2, 2016

What ever happened to community over code?

shykes · on June 3, 2016

Part of a healthy community is accepting that people disagree a lot, have different values, and communicate their ideas in very different ways.

Where we draw the line is if people are being intimidated, bullied, insulted, or anything that even remotely resembles harassment.

Although I personally feel that some of the comments in that thread are pretty unfair and poorly informed, they don't seem to violate the social contract.

cridenour · on June 2, 2016

It's also possible that the latest release wasn't pushed to the commercial repo.

justincormack · on June 2, 2016

The commercial releases are different to the open source ones (they have different patches, and they do different releases), and have a different release schedule.

stavros · on June 2, 2016

And don't put everyone on the same bus, I guess.

sleepychu · on June 2, 2016

Actually since the aim is to avoid being hit by a bus, inside a bus seems like a pretty good place to put everyone.

FireBeyond · on June 2, 2016

Also called (by me) as "Baldrick's Bullet": https://www.youtube.com/watch?v=pKRxX3s3JlM

Captain Blackadder: Baldrick, what are you doing out there?

Private Baldrick: I'm carving something on a bullet, sir.

Captain Blackadder: What are you craving?

Private Baldrick: I'm carving "Baldrick", sir.

Captain Blackadder: Why? Private Baldrick: It's part of a cunning plan, sir.

Captain Blackadder: Of course it is.

Private Baldrick: You know how they say that somewhere there's a bullet with your name on it?

Captain Blackadder: Yes?

Private Baldrick: Well I thought that if I owned the bullet with my name on it, I'll never get hit by it. Cause I'll never shoot myself...

Captain Blackadder: Oh, shame!

Private Baldrick: And the chances of there being two bullets with my name on it are very small indeed.

Captain Blackadder: Yes, it's not the only thing that is "very small indeed". Your brain for example- is brain's so minute, Baldrick, that if a hungry cannibal cracked your head open, there wouldn't be enough to cover a small water biscuit.

Sanddancer · on June 2, 2016

You want to avoid sticking everyone on the same (air)bus too.

https://en.wikipedia.org/wiki/Pacific_Southwest_Airlines_Fli...

mpnordland · on June 2, 2016

Assume N is the total number of buses. Where N=1 your solution works. For N>1 the engineers can still be hit by a bus. This also ignores all other vehicles such as garbage trucks, semis and stealth bombers, all of which can take out your bus.

stavros · on June 2, 2016

The problem was that, even though Reddit had multiple engineers, they were all on the same plane, and thus unavailable at the same time.

kfrz · on June 2, 2016

There are two tiers to Docker's support, for certain, but as pointed out on the Github issue by a Docker team member here (https://github.com/docker/docker/issues/23203#issuecomment-2...), there's a definite urgency sensed by the team.

Rapzid · on June 2, 2016

I don't think this would surprise anyone that has used Docker Hub in their CD pipeline. So many reliability issues.

We moved off it last year. When we went to cancel our subscription the other month downgrades/cancellations were broken on the site as a known issue; had to open a support ticket. Most of the UI issues were still present along with some new ones.

0x0 · on June 2, 2016

It's scary how most people in that thread seem to be more concerned about forcing an installation, rather than pause and consider why the hashes might be wrong and why it might not be a good idea to install debs with incorrect hashes.

If the apt repo was compromised (but the signing keys were not), this is very likely exactly the symptom that would appear.

cjbprime · on June 2, 2016

> If the apt repo was compromised (but the signing keys were not), this is very likely exactly the symptom that would appear.

I don't think that's correct. It would pass a checksum test and fail a signature test with a "W: GPG Error". The checksum test is not about cryptographic security, it's just about files referenced by the Packages file having the same hash that the Packages file declares them to have. You don't need any signing keys to make that happen.

0x0 · on June 2, 2016

What's more suspicious: Bad hashes or bad signatures? What would an attacker choose if their goal was to get as many people as possible to force install?

cjbprime · on June 2, 2016

It's impossible to force install the packages when they have bad hashes (hence the severe breakage here), and it is possible to install the packages when they have bad signatures if you didn't import the gpg key or don't run with signature checking.

So I'd guess a rational attacker would choose a bad signature. But attackers can be irrational; it doesn't prove it's not an attack. Just not my intuition.

0x0 · on June 2, 2016

That's interesting, I'm assuming you're talking about apt now. I don't think dpkg checks signatures if you install straight from a .deb. :)

jwilk · on June 2, 2016

It doesn't, mostly because there are no signatures in a deb. :)

mapleoin · on June 2, 2016

This is a really bad title. There is nothing wrong with either Ubuntu's or Debian's repositories. The problem is with Docker's repositories of Ubuntu/Debian packages.

brazzledazzle · on June 2, 2016

I'm a bit disappointed that people are willing to make public criticisms of Docker when it's their builds that are failing. They made the decision to depend on a resource that could be unavailable for a large number of reasons entirely unrelated to Docker or their infrastructure.

Just like the node builds that failed this should cause you to rethink how you mirror or cache remote resources not prompt you to complain about your broken builds on a github issue page. There may be things you'll never be able to fully mirror or cache (or could just be entirely impractical) but an apt repository is definitely not one of them.

willejs · on June 2, 2016

mschuster91 · on June 2, 2016

... which is why the clever sysop mirrors his packages and tests if an update goes OK before updating the mirror.

If you're running more than three machines or regularly (re)deploy VMs, it is a sign of civilization to use your mirror instead of putting your load on (often) donated resources.

It's the same stupid attitude of "hey let's outsource dependency hosting" that has led to the leftpad NPM desaster and will lead to countless more such desasters in the future.

People, mirror your dependencies locally, archive their old versions and always test what happens if the outside Internet breaks down. If your software fails to build when the NOCs uplink goes down, you've screwed up.

ajarmst · on June 2, 2016

I often wonder why the community's response to issues with an open/free/community package is to give the maintainers a strong argument to discontinue it in favour of a commercial one, or just abandon it altogether.

ajarmst · on June 2, 2016

"Why I Haven't Fixed Your Issue" --- http://www.brycematheson.io/post/why-i-havent-fixed-your-iss...

therealmarv · on June 2, 2016

I think this is a combination of chains which are dependent on eachother, especially when you use Travis CI: 1st) apt-get not flexible enough to ignore that error on apt-get update 2nd) Travis CI having so much external stuff installed, it's a big big image which has more failure points 3rd) Docker repo failed.

perlgeek · on June 2, 2016

Outages or mis-configurations can happen to pretty much any source of packages you use, be it debian, pypi, npm, bower or maven repositories, or source control. Anybody remember left-pad?

So as soon as you depend heavily on external sources, you should start to think about maintaining your own mirror. Software like pulp and nexus are pretty versatile, and give you a good amount of control over your upstream sources.

smegel · on June 2, 2016

Sometimes paying for RHEL isn't a bad thing.

nativityscenes · on June 2, 2016

[flagged]

dang · on June 2, 2016

Please stop.