Scaling lessons learned at Dropbox (2012)

nosefrog · on April 8, 2022

I was at Dropbox from 2015-2020, and it's funny how many things have changed and how many stay the same.

Changes:

- When I joined in 2015, it was already forbidden to just log text on the server. Only a small number of SREs even had access to the text logs. Any event that you cared about needed to be logged from the system described in "App-specific metrics", which was pretty easy to use. (That said, on desktop we still used logspam because when you're debugging a desktop client doing the wrong thing, sometimes the only lead you have is tracing which lines of code are being run by looking at which logs are being printed.)

- Re: shards, by the time I joined most of the mysql was replaced by a homegrown KV/graph database on top of mysql that had a fixed number of shards (something like 255).

Things that didn't change (at least by the time I left in 2020):

- Everything was in UTC, including in the UIs for everything (e.g. metrics, exceptions, crashes, etc). I ended making a GPS clock that showed UTC to put on my desk as a fun project, and I know at least one SRE who had set basically all their clocks to UTC time (e.g. phone, laptop, calendar, etc). This is in contrast to Google, where almost every system either shows your local time or "Google Standard Time" (which is just Pacific Time).

- Python was still used for virtually everything.

temp_praneshp · on April 8, 2022

Fellow ex-dropbox (2016-2021), the only part I disagree with is "Python was still used for virtually everything". Several core internal and user-facing services were in golang by even early 2020 (authentication, equivalent of IAM, part of the API gateway, etc).

Unfortunately, I don't think I'd describe dbx as a place with simplicity in infrastructure during the time I worked there, but it was good to have a standard set of rules enforced everywhere.

maxclark · on April 8, 2022

"One technique we repeatedly used was creating artificial extra load in the live site. For example, we would do a lot more memcached reads than necessary. Then when memcached broke, we could quickly switch off the duplicate queries and have time to come up with a solution."

As much as I want to hate this, it's actually genius and elegant in its simplicity.

altdataseller · on April 8, 2022

I do something similar with disk usage. Create a dummy file with XX GB and delete it when you are running out of space

raffraffraff · on April 9, 2022

I worked at a shop where they didn't bother with a partitioning scheme - everything was just one large volume. The amount of times I've had to do emergency cleanup on production servers...

fermuch · on April 8, 2022

IIRC ext4 does this already. Saved me a couple of times.

xxpor · on April 8, 2022

Are load tests not a standard thing people do any more?

mplewis · on April 8, 2022

When unexpected load causes a production outage, it's often something you wouldn't have predicted using a load test.

xxpor · on April 8, 2022

Sure, but I just mean sending artificial load to see where you tip over.

liamkinne · on April 8, 2022

Only for the cases you are set up to test. This is about headroom on your golden resources so any kind of overload you can buy yourself time for.

mapme · on April 8, 2022

Reminds me a bit of this AWS builders post about doing constant work as variance can lead to instability.

https://aws.amazon.com/builders-library/reliability-and-cons...

yaseer · on April 8, 2022

This may be from 2012, but it's surprising just how much of the specific advice still applies.

Grep, sed, awk, xargs etc are still the canonical list of shell tools for debugging, and have been for decades.

Makes you realise, learning Unix tools will likely be a better long term time investment than the latest trendy framework. I wish I could go back in time and tell that to my younger self...

gjkood · on April 8, 2022

Couldn't agree more!

When I am forced to use a Windows environment I find myself reaching out to these evergreen tools in many situations and is frustrated by having to find workarounds to these tools that should be ubiquitous tools in any OS.

Everything I learned almost 30 years ago still holds true today when it comes to the basics.

And yes I know of Cygwin, MKS toolkit etc, but trying to convince corporate IT in a Microsoft Centric world to install these usually hits a brick wall.

_clhx · on April 8, 2022

Sometimes nothing beats grep, but using a Windows environment I usually just go to Python to create a quick script to handle this stuff.

Lots of times it is really nice to have all your "debugging scripts" in a repo anyway. Once you have that it isn't hard to set up a basic Python environment and have your scripts there.

0des · on April 9, 2022

When your logs are big enough python doesn't really compare in terms of speed but maybe in terms of speed to write the thing you need to query them if you don't know how to use the aforementioned tools

pjmlp · on April 9, 2022

Or learn the ways to achieve the same with PowerShell, and Technet utilities.

Not every OS needs to be an UNIX clone.

______-_-______ · on April 8, 2022

Thankfully git is everywhere these days, and git on windows comes with most of the coreutils. I hope IT isn't telling you you're not allowed to use source control. Microsoft owns GitHub after all

capableweb · on April 8, 2022

Microsoft Azure also offers Blockchain solutions, that doesn't mean every company using Microsoft's C# is generally pro-blockchain, just as it doesn't mean every company is using Git because they are a Microsoft-shop.

mciancia · on April 8, 2022

Eh, forcing people to use Windows in big corps is such PITA :(

bobbylarrybobby · on April 8, 2022

This is a fun read on the topic: https://adamdrake.com/command-line-tools-can-be-235x-faster-...

charlieflowers · on April 8, 2022

What are some outstanding resources to learn from?

PeterWhittaker · on April 8, 2022

Honestly, I think the best way to learn these tools is to use them. After starting at BNR in 1990, I sat for a couple of days with Sobell's Practical UNIX on my lap, doing every example and every exercise; since them, I've stayed at the command line as much as possible. I've looked periodically for other, more recent resources, at least in part to answer questions such as yours, but I've never found anything that packed the punch of that early Sobell, not even later Sobell's.

If you want to direct the learning, though, I'd recommend figuring out how to do something with grep, e.g., then do the same with sed, awk, cut, etc.

For example:

grep myuserid /etc/passwd

sed -n '/myuserid/p' /etc/passwd

awk '/myuserid/ { print }

etc.

Then make it more interesting: repeat those with 'grep -I MYUSERID /etc/passwd'.

Then make it even more interesting: print the 4th field from each matching line. (Hint: man cut.) Come up with at least two or three different ways of doing this with regex's, then make those regex's as compact as possible.

Then make it a lot more interesting: print the two lines before and the three lines after each pattern.

Once you do this for a while, you will get used to thinking in terms of filters and regex and things will start to come more naturally for you.

Also consider doing all of those exercises from within vim, use :g... commands (or any other alternative you can think of).

dsanchez97 · on April 8, 2022

I liked the Missing Semester by MIT. It provides a first principle style for learning some of the foundational tools. https://www.youtube.com/playlist?list=PLyzOVJj3bHQuloKGG59rS....

daemoncoder · on April 8, 2022

https://www.oreilly.com/library/view/unix-power-tools/059600...

I used this a great deal, when I first started out on serious UNIX tasks (1996+)

jacob019 · on April 8, 2022

Use UTC everywhere. 10 years later it is still an important reminder. This should be the first thing new developers learn.

mc4ndr3 · on April 8, 2022

In services, I find telemetry so much more useful and maintainable than logging. Logging tends to grow without bound, along with its noise. While telemetry encourages a more focused, intentional approach: Track exactly the metrics which are repeatedly useful, and don't block on I/O whenever you absolutely don't have to.

IshKebab · on April 9, 2022

> For instance, almost every website has a thing where if you enter in a wrong username OR wrong password it’ll tell you that you got one wrong, but not tell you which one. This is good for security because you can’t use the information to figure out usernames, but it is a GIANT pain in the ass for people like me who can’t remember which username they registered under. So if you don’t actually care about exposing usernames (maybe on something like a forum or Pinterest where they’re public anyway), consider revealing the information to make it more convenient for users.

YES! Such a pain in the arse for a tiny theoretical unimportant security benefit. In some situations it is warranted, e.g. dating sites. For something like Amazon or Wikipedia who cares if you know that I've registered there?

_clhx · on April 8, 2022

I love hearing about services like this that managed to scale with not-so-cool tech.

Where are the stories about services that have scaled with the latest fad? Maybe they will be posted in a few years - or maybe those teams are too busy to blog ;)

mc4ndr3 · on April 8, 2022

Interesting tips.

Early attempts at what later became Chaos Monkey / Simian Army.