Automating Datacenter Operations at Dropbox

nemothekid · on Jan 17, 2019

A theory based on the article is it seems Pirlo may be written in Python (going off the fact that is leverages SQLAlchemy) - which is interesting given most providers are writing new infrastructure code largely in Go (Spinnaker is another Python exception).

I'm guessing that Python is still heavily used inside Dropbox, but does anyone know if they have published any style guides or tooling to managing Python codebases at their scale?

amanzi · on Jan 17, 2019

Dropbox hired the creator of Python to help migrate their huge Python 2 codebase to Python 3. It would make sense for them to continue with their investment in Python.

lwf · on Jan 17, 2019

Dropbox sponsors most of the developers of http://mypy-lang.org/ , an optional static typing system for Python.

I've found it to be hugely useful in safely working in a large Python codebase :)

(NB: I work at Dropbox)

mroche · on Jan 17, 2019

Curious if you’ve ever tried Cython out. Not quite the same, but a semi-similar end goal to a degree. I started looking into the other day and it might provide some nice improvements to heavier applications.

jabl · on Jan 17, 2019

> A theory based on the article is it seems Pirlo may be written in Python (going off the fact that is leverages SQLAlchemy)

Sounds like a reasonable guess.

> which is interesting given most providers are writing new infrastructure code largely in Go

Dropbox apparently uses Go a lot, see e.g. https://about.sourcegraph.com/go/go-reliability-and-durabili...

They also use Rust for some performance-critical stuff, e.g. https://news.ycombinator.com/item?id=11282948

That being said, AFAIK originally Dropbox was written mostly in python. Probably there's still lots of that left.

talonx · on Jan 25, 2019

This is also borne out by the fact that they mention Celery.

jedberg · on Jan 17, 2019

Unless I'm mistaken, Guido still works for Dropbox. And as far as I know they still mostly do Python.

danpalmer · on Jan 17, 2019

> which is interesting given most providers are writing new infrastructure code largely in Go

I'm not sure this is true. There's a lot of Java around, Scala is pretty popular too, in many industries C/C++ are the norm still for this sort of code.

Go might be the trend in open source infrastructure projects right now, but a significant amount of that is likely to be inertia from Docker and Kubernetes.

jensvdh · on Jan 17, 2019

Spinnaker is written in Java

0xbadcafebee · on Jan 17, 2019

"While there are some excellent job queue systems such as Celery, we didn’t need the whole feature set, nor the complexity of a third-party tool. Leveraging in-house primitives gave us more flexibility in the design and allows us to both develop and operate the Pirlo service with a very small group of SREs."

I don't know any more than this paragraph or two explains, but it sounds like NIH syndrome. If you chose to write your own solution just because a 3rd party one was complicated or expensive, you've underestimated the complexity and expense of developing and supporting new software. Not only do you have software developers developing your business products, but now you have software developers developing the IT tools that support the software developers writing the business products.

"Using the network database and configuration tool developed by our Network Reliability Engineering (NRE) team,"

Another custom tool? Network inventory and config management tools do exist already...

"Rather than having engineers manually running tests using playbooks, Pirlo performed an automated sequential battery of tests that reduced the need for hands-on attention and concurrently increased diagnostic accuracy."

Or you could, like, install Jenkins, write your tests, and do all this without writing your own distributed job queue system.

mrbanks · on Jan 17, 2019

Developers at these big companies get bored & assume they know better than a mature open source solution so like to reinvent a perfectly good wheel basically.

inferiorhuman · on Jan 17, 2019

Or they've actually used Jenkins (or any of these other suggested alternatives). I've used Jenkins personally and professionally.

Most recently I've been rebuilding my own CI stack and never really gave much thought to going back to Jenkins. So I've been asking around and one of the only common complaints I've heard so far is that getting the initial configuration done is painful and generally orthogonal to automation. Plus the documentation is atrocious.

No off-the-shelf product will be a perfect fit, but with CI software I was truly surprised at just how large the gaps were.

So, sure, if you're Dropbox and you want to automate everything Jenkins is almost certainly not the right tool for the job. If Dropbox already had a supported, mature in-house job queue system, why not use it?

Conversely at megacorp, they spent 4+ years claiming to work on deploying a Jenkins (CloudBees) cluster and still came up with bupkis. Our own internal job and message queuing systems were astoundingly bad (and support for internal tools was almost entirely forbidden).

At megacorp I absolutely decried any sort of home grown solution. But if Dropbox were to actually tackle the problems of internal testing and support and come up with mature solutions, why shouldn't they use them?

kohsuke · on Jan 18, 2019

Hi, I'm Kohsuke, the creator of Jenkins. I'm sorry to hear that you had a bad experience.

Would you be willing to letting me interview you so that I can learn where it failed your expectation? I'm honestly trying to learn where we can do things better, and often what's obvious to one person is completely incomprehensible to another. So I think this is a great opportunity for me to learn a fresh perspective.

My contact information is in my personal profile.

inferiorhuman · on Jan 20, 2019

Jenkins isn't a terrible experience, and I've used it personally and professionally (and would do so again where it's a good fit), but for my current project it missed a few of the requirements. In trying to rationalize the whole NIH thing, I talked to some friends and peers about their CI experiences. I got pretty consistent responses on Jenkins.

My relevant requirements:

1.) The software needs to be self-hostable and run on the BSDs. For the most part this narrows down the options to buildbot and the Java based CI options (Jenkins and GoCd). Travis could probably be run on FreeBSD, but the open source bits are essentially abandoned (e.g. some repos are missing) with no documentation. Nearly everything else these days is strongly tied to Linux via docker. Some free hosted services offer a FreeBSD target, but I'm looking to test on DF/Free/Net/OpenBSD.

2.) The software needs to scale down. The GoCd folks suggested that the agent would need around 500 MB of RAM. I haven't profiled Jenkins, but I can't imagine the agent being that much lighter weight. Certainly the Jenkins server process is glacially slow. By contrast my prototype in Rust is showing memory usage of under 5 MB for each process (agent + server). I expect that to grow a little but, but not by an order of magnitude.

3.) The software needs to handle multi-arch builds. Travis does this extremely well. Buildbot and GoCd, kinda. Jenkins does not handle this use case (e.g. pipelines + matrix builds are not supported). I really like the way Travis basically handles these as sub jobs.

My experience:

A.) The Jenkins documentation is terrible, if it exists at all. I've heard that this has been improved in the year or so since I've looked at Jenkins (but that hasn't been my experience). I mentioned this to one of the CloudBees guys at the DevOps Days conf I went to last year and got an ack that this is a known issue (although CloudBees has driven a ton of Jenkins documentation and improvement). At MegaCorp we paid a fortune to CloudBees, which helped a ton but didn't really help end users. I cannot understate just how much of a detriment the documentation is.

At the opposite end of the spectrum rust (except for the async stuff) and postgres are just a dream come true. If it's any consolation the GoCd documentation is pretty atrocious as well. Almost none of it is up to date with the current UI.

B.) The Jenkins community tends to cargo cult Jenkins-Groovy snippets like crazy, potentially as a result of #A. Having a good community helps documentation and helps when there are gaps in the documentation.

C.) Bootstrapping Jenkins is not something easily done in an automated way. The CLI is not stable and I had tons of trouble trying to get plugins and dependencies sorted without having to drop into the GUI. For homelab stuff I've automated bootstrapping of nearly everything except for Jenkins with Ansible.

I don't think these are new or unknown issues as in talking to friends and peers I've found that the typical responses regarding Jenkins are along the lines of: Jenkins works well enough so that we're not motivated to switch, but A & C are our main pain points.

kohsuke · on Jan 24, 2019

Thanks for taking time to put this together. Yes, much of it isn't new, but it's always good to hear how these dots are connected in other people's view to form a theme.

On C, I think we've made a good progress <https://github.com/jenkinsci/configuration-as-code-plugin> that I think you'd like.

glenngillen · on Jan 17, 2019

Have you looked at https://buildkite.com/? A great UX and workflow for orchestrating tests, but the tests still run entirely in/on your own infra.

inferiorhuman · on Jan 18, 2019

I have not. The continuous integration infrastructure at megacorp was merely a symptom of some major cultural problems that over the course of a few years just weren't going to be addressed to my satisfaction, so I left. Paying for Cloudbees was a painful exercise, and I think the guy that was pushing for it ended up getting shitcanned. After all, the whole point of deploying Jenkins was to avoid paying for TeamCity.

For my personal stuff, I haven't looked at buildkite primarily because I'm looking for free, open source solutions. The secondary (but no less important) issue is that I'm also looking for coverage for *BSD. Buildkite lists FreeBSD support so, assuming it's not a Linux binary expected to run under the translation layer, they could presumably offer binaries for the other BSD variants. Right now I've got some buildbot workers setup on some bhyve VMs with df/net/openbsd and some jails with FreeBSD.

The big thing I've noticed is that lots of tools will list FreeBSD support at least, but few of those claims are actually tested. Even fewer mention any of the other BSDs. For example: Elixir deployments on FreeBSD have been broken for months; Hashicorp put out a FreeBSD build of their vault ssh helper that didn't work by design -- but they had no idea because they never tried to validate the build; rust's libc crate lists FreeBSD, NetBSD, and OpenBSD as tier 1 platforms, but they're only testing the crate on an old version of FreeBSD (where they've disabled some key tests due to a misunderstanding rendering the tests not super useful)... plus they're not in a good position to deal with the C ABI changing between releases.

swish_bob · on Jan 17, 2019

The irony of people saying essentially "Dropbox could have done this just by using open source tools" on hackernews, where people famously announced the Dropbox would fail because it was easy to reproduce using open source tools entertains me.

Maybe, just maybe, the engineers in question considered mature open source tools and decided they didn't serve their needs.

danpalmer · on Jan 17, 2019

> I don't know any more than this paragraph or two explains, but it sounds like NIH syndrome. If you chose to write your own solution just because a 3rd party one was complicated or expensive, you've underestimated the complexity and expense of developing and supporting new software. Not only do you have software developers developing your business products, but now you have software developers developing the IT tools that support the software developers writing the business products.

I think the part "Leveraging in-house primitives" is the important bit here. It sounds like they have an in-house task queue, so introducing another external one like Celery might add a lot of operational complexity rather than piggy backing on existing infrastructure. As for the reason for the existing infrastructure not being Celery, I suspect at Dropbox's scale tools like Celery start to break down.

> Or you could, like, install Jenkins, write your tests, and do all this without writing your own distributed job queue system.

They likely haven't written their own job queue, they've just used an existing one. Again, the operational overhead of Jenkins is unfortunately very non trivial.

swish_bob · on Jan 17, 2019

> Again, the operational overhead of Jenkins is unfortunately very non trivial.

Not to mention the overhead of running Jenkins at the kind of scale they're talking about here. We're not talking about small numbers here.

0xbadcafebee · on Jan 17, 2019

I don't know other people's experience with Jenkins, but I didn't find it to be much overhead. It's got all the capabilities you need to run at high scale, assuming you use it properly, like using the DSL and containers, and managing the configuration and deployment as code.

Jenkins X, for example, is inherently cloud-based and dynamically scaling. If you need a generic job queue to grow with the organization, that's what it does. And you can even pay someone to set you up, reducing lead time.

But there's also a dozen similar systems out there, so you're not even limited to this one. There's lots of choices and really no need to make a new one.

Re-using an existing component to make a new tool is still an independent software development project, which will almost never be as cheap as integrating a finished tool.

chrisweekly · on Jan 17, 2019

>> "in-house primitives"

^ perfect band name