Discord outage postmortem

coder543 · on March 23, 2020

> Our service discovery system is resilient to various failure modes within etcd, however, we did not anticipate a client being able to write a corrupt record due to improper HTTP handling. golang's net/http module — which is in violation of the HTTP 1.1 specification in this specific case — introduced an additional failure case which we did not anticipate. It is expected that a Content-Length header and an incomplete body should be rejected by a server as being incomplete. Unfortunately net/http does not check that the bytes read from the body portion of the request match the Content-Length header.

Whichever engineer wrote puts too much weight on the theoretical aspects of HTTP.

The reality is that any of Discord's software could have had erroneous code that announced an empty service list. This would have crashed all of Discord regardless -- transient network errors and mismatched HTTP Content-Length headers need not apply.

A resilient HTTP server will try to handle whatever the HTTP client sends it, and then the specific endpoint can enforce whatever levels of strictness are needed by the application. It's nice to be able to point a finger at "not my code", but how many popular web server libraries (not standalone web servers like nginx or apache) out there will actually flat out reject the entire request because of a simple Content-Length mismatch, rather than handing that decision off to the endpoint?

If an empty string is not a valid value for your service discovery system, you should reject those from the database with a CHECK CONSTRAINT if possible, or with the code that sits in front of the database if CONSTRAINTS don't exist (like I believe is the case for etcd). It's also not a bad idea for the clients to validate what they receive and filter out values they can't handle to enhance robustness.

You definitely shouldn't rely on nothing ever attempting to put a bad value into the database.

The web server's willingness to accept "strictly" malformed requests when the engineers were unaware of this fact was certainly a contributing factor, but I don't think it was the root cause in this case, as they're heavily trying to suggest.

(I think it would be neat to have a "strict" mode that you could turn on for every web server, especially when being used as an internally-facing service, but most companies want to make it as easy as possible for their customers/partners to integrate with their service... and that means accepting requests that aren't always academically perfect.)

Thaxll · on March 23, 2020

> You definitely shouldn't rely on nothing ever attempting to put a bad value into the database.

It's a deeper problem than that because etcd doesn't know what Dicord considers as a "right" value, etcd stores whatever you send to it, empty values etc ...

The question is why a single bad formated JSON object created an entire fleet of healthy nodes to fail. There is some JSON parsing somewhere that went really wrong.

bcrosby95 · on March 23, 2020

I've only used Erlang in hobby projects, but it seems like a consequence of the "let it crash" philosophy being applied in an awkward situation. But I don't have the experience to know if this philosophy is something Erlang/Elixir programmers ignore in a circumstance like this, and if so how they determine when circumstances are such that "let it crash" should be ignored.

SolarNet · on March 23, 2020

I feel like this is a misapplication of that philosophy.

Though my first exposure to (serious) exception handling was through the common lisp signal/restart system. So I view software has layers and the communication of errors up and down those layers. Rust for example provides this through Result types and either composing or implementing error types.

My point being that the parsing of a record of a service discovery system should crash just that parser process. Not the whole system of processes, the manager of those parser processes should note (log) that the parser crashed on a record and then ignore it (probably as part of a broader refresh system I would imagine). That would be the correct way to handle errors up and down because the point of a process is to do a computation and communicate through reliable channels. Crashing can be a reliable way to communicate (it is communication of an unreliable state, but the point of letting it crash is that the resulting crash can be seen as a reliable communication) and the manager of the parsing process should have had a case for it that wasn't just crash.

mmartinson · on March 23, 2020

The sort of crash looping that took down their nodes is one of the parts of OTP that I’ve found to be both quite unintuitive and dangerous, for exactly the reasons described. There are controls for max retry attempts within a time window, but no obvious way from what I can tell to require a supervisor to halt rather than propagate a crash loop in a way that is easily auditable. This has caused me bad failures as well, and I’d be curious to hear what others are doing here. It’s not that it’s a hrs problem to solve, but that it’s an easy thing to overlook.

toast0 · on March 25, 2020

Max restarts is a big sharp edge. I think anybody running Erlang in production must have hit it, probably a couple times. For things that I expect to fail regularly, but have no business bringing the system down, I reflexively sleep in the init, so it would only trigger max restarts if I really messed up. If it's consistently failing, it'll log enough to be visible, but it's not like this is an ideal way to handle it.

jchw · on March 23, 2020

I think it would be nice if the HTTP server could at least optionally validate that the length is expected. For example I would really not expect a malformed chunked encoding lacking a terminating zero length chunk to silently succeed.

However, I think you are dead on the money. Etcd is not validating the values but Discord’s software seems to assume the etcd values must be valid. That seems like a recipe for future outages if the mentality were kept this way.

q3k · on March 23, 2020

Yeah, this is reliability engineering 101 - don't ever expect that outside content won't be malformed, even from 'trusted' systems. There was a codepath somewhere that went from an unmarshal error into a crash - and that's something that should not pass any code review.

toast0 · on March 23, 2020

In Erlang (and Elixir) systems, turning unexpected errors into crashes is what you're supposed to do. Unfortunately, it's pretty easy for that to spiral into node shutdown/restarts, everybody ends up learning that one the hard way.

Supervision trees really do need to be considered, but what are you expected to do when the service discovery system is unreachable, broken, or filled with bad data? There's certainly options, giving up isn't necessarily the wrong one.

q3k · on March 23, 2020

> what are you expected to do when the service discovery system is unreachable, broken, or filled with bad data?

Different things based on which one of these actually happened!

I would assume the code that caused this issue was along the lines of:

    var addresses []string{}
    for k, el := range etcdEntries {
        var element data.Service
        addr, err := json.Unmarshal(&element, el)
        if err != nil {
            return fmt.Errorf("could not unmarshal service entry %q: %w", k, err)
        }
        addresses = append(addresses, addr)
    }

And the issue is the lack of graceful degradation, in the form of not failing the entire conversion if one entry is broken. Instead, this should be something like:

    var addresses []string{}
    for k, el := range etcdEntries {
        var element data.Service
        addr, err := json.Unmarshal(&element, el)
        if err != nil {
            metrics.ServiceEntryBroken("key", k).Up()
            glog.Errorf("Could not unmarshal service entry %q: %v", k err)
            continue
        }
        addresses = append(addresses, addr)
    }

Of course, this takes much more effort. But such is the nature of writing high quality, reliable software.

Naturally, if your entire SD service was down, you would fail the entire thing much earlier. That is indeed the thing about Go errors being plain variables - you are intended to write error handling logic (or at least are forced to consider it), not just pack the buck down by default, via either optional types or exceptions.

jchw · on March 23, 2020

This may be true, but malformed input that is unchecked is not, imo, an “exceptional” case but nominally expected, and kind of dangerous since it is one of those things that isn’t likely to happen often since it requires another bug or fault to reveal itself.

It’s probably an unpopular opinion at this point but I definitely appreciate Go’s approach to error handling here. Error cases are in-your-face by default for the most part, leaving nearly only programming errors and critical faults to panic ideally.

toast0 · on March 23, 2020

The important idea of the Erlang approach is not to leave things unchecked; the idea is that the proper response to most errors is to restart from a known good configuration.

Of course, restarting the service discovery client may not help, especially when the service discovery system's state is stable but broken. I don't know what you're supposed to do if you're dependent on services, and the service discovery system you use to find the services is not reachable or is returning garbage information. Better logging is always nice, but you can't serve the traffic.

thedance · on March 23, 2020

Well, let's try to imagine if BNS pooped out a corrupt update. Would every BNS client just silently pass on it, or would they all instantly suicide? I'm honestly not sure it's the former.

jchw · on March 23, 2020

I don’t understand this hypothetical perhaps because I don’t know what “BNS” is in this context. However, I think it’s reasonable to assume that if you have a key-value store and you haven’t modified it to validate your scheme, that there’s no guarantee the scheme will be valid across the database, and if your code doesn’t tolerate these conditions, eventually a bug will reveal that.

thedance · on March 23, 2020

https://landing.google.com/sre/sre-book/chapters/production-...

jchw · on March 23, 2020

I did not know BNS was publicly known so apologies for playing stupid.

Of course BNS is at least designed for this specific use case. So this particular error, which imo is schema validation related, is not likely to occur.

I am sure some seasoned Google SREs have a good grasp on what would happen if BNS was spitting out junk, but my guess is it would at least cause outages of some kind if not crashloops.

ufmace · on March 23, 2020

Seems it's Borg Naming Service. I guess Borg is some Google-specific thing.

mst · on March 23, 2020

While I agree that discord's code needed to be modified to handle bad data in etcd just in general ... this is not really a theoretical aspect, checking the content length so you don't try and deserialise a partial body and send crap to your business logic is http server writing 101.

Discord should've been more defensive.

The net/http devs should file this under "mistakes to admit to over beer to cheer up newbies who've just done something really dumb." (I have a number of such mistakes under my belt, we all make mistakes this stupid eventually ;)

jchw · on March 23, 2020

In my opinion, the HTTP server’s Request.Body’s io.Reader should return an error on read when the request body ends unexpectedly, rather than having every use of the request body try to sanity check the length which seems cumbersome and error prone.

Jasper_ · on March 23, 2020

I believe it should, but it's likely a userspace proxy like HAProxy is in the way causing errors to be swallowed, as there's no way with the current sockets API to intentionally send an error over the wire.

jchw · on March 23, 2020

Hi Jasper :)

Do you think so? I checked the Go http server code and my not-so-careful reading seems to confirm their suspicions: it just reads from the body io.Reader, which doesn't seem to validate length. (They DO use a limit reader to prevent DoSing with huge bodies, at least.)

Now if they were phoning out to a remote server, I’d assume some kind of fabric or load balancer. But I assume they’ve got a local etcd on each node or something along those lines so it is very possible that it’s all local networking.

Though in most cases, outside of this use case, I’d agree that this is not particularly useful because it would never be terribly reliable. I know for a fact Amazon ELB rewrites the entire request, as an example...

Jasper_ · on March 23, 2020

I was talking about getting an error when reading from io.Reader, like a TCP timeout. I was quite sure the error would bubble up, but it of course wouldn't if there was no error to begin with, with something like HAProxy in the way.

Rapzid · on March 23, 2020

I haven't found the code yet but it should throw a line too long error for malformed chunked encoding.

The implementer can always compare the read bytes to the content-length themselves. I have a hard time believing without evidence that etcd's implementation is wrong at this stage..

csears · on March 23, 2020

As part of the fix, they mention changing the Python etcd client code to send the request in a single packet. How would one force that behavior? And is that even a good idea?

It seems like TCP fragmentation would be handled at a much lower level in the Python networking stack.

toast0 · on March 23, 2020

There's two ways to force this:

a) write the whole request in a single write call like MarkSweep said.

b) use tcp socket options to change behavior of multiple writes, TCP_NOPUSH on BSD, or TCP_CORK on Linux.

Option a is preferable in this case because it's a lot fewer system calls; although it's a once a minute job, so system calls on the client side don't really matter.

Is it a good idea? In this case, certainly --- if the whole request is small enough to fit in the network MTU, it avoids a protocol error on the server side.

In the more general case of a HTTP request where the content is known and in memory at the time of header generation, yes, I would say it's more efficient to send the content with a single context switch (buffer space permitting) and fewer network packets. With a caveat to be aware of, that if the larger network packets trigger MTU problems, the experience will be negative.

If the content size is known, but the content isn't loaded into memory, such as when you're uploading a file, so fstat gave you the size, but you haven't read it yet, sending off just the headers with content-size is probably better --- if the content takes time to load, you'll get the request to the server to validate it sooner than if you waited for a full packet, and that improves processing time in case the server rejects the request, or has to do something time consuming before reading the content.

If the content size is unknown, so presumably the content hasn't been chosen or generated yet, sending headers soon is usually better for a similar reason.

Clearly, the etcd server should be changed to properly process HTTP requests that arrive over multiple TCP segments, but it was likely more expedient to "fix" the client as part of the rapid response. But, this client change could still be useful, as it should reduce packet processing load on the etcd server, and the various network equipment. Probably not a big impact, unless there's a ton of clients, but similar changes where the usage is higher can make a big difference.

hackernudes · on March 23, 2020

I don't think it was related to TCP fragmentation. It must have been a different network issue, timeout, crash, or other bug that caused the connection to close.

They were relying on the server side to verify the http Content-Length header to avoid partial writes, which never happened and they consider this to be a bug in the golang http handler.

Quoted from the writeup: """ We determined that the connection was reset after sending the first packet, but before the second packet could be sent. """

My bet is when they say "packet" here they really mean write/send. They describe the old version as doing one write for the http headers and one for the body. The new version makes just one write.

MarkSweep · on March 23, 2020

I’m guessing they switched from making two system calls to send data on the socket to a single system call.

q3k · on March 23, 2020

That doesn't guarantee the write data won't become fragmented on the wire.

jchw · on March 23, 2020

This is my guess - two write calls to one write call. I think this will happen to work even in the presence of Nagle’s algorithm, because there is initially not going to be an already sent packet waiting acknowledgement, so at least the first write call will probably be unbuffered in that sense.

Assuming an mtu of 1500, ~40 byte TCP header, and a little bit of HTTP/1.1 headers, that leaves probably plenty of room for the etcd PUT to not get fragmented, I think.

q3k · on March 23, 2020

That's a lot of assumptions and 'probably' for what supposedly fixes a critical bug.

jhgg · on March 23, 2020

There are two fixes listed. One is to improve client behavior to reduce the likelihood of an incomplete request to be handled. The other is to fix the crash bug that occurred when an incomplete request is handled. The fix would be incomplete with only the first mitigation.

jchw · on March 23, 2020

No disagreement here - it could certainly break in the future, and is certainly not a solution anyone should rely on. However my reading of this is that it would have worked the same way like 20 years ago, so it’s probably not imminently at risk of working differently tomorrow.

q3k · on March 23, 2020

I'm not so sure. It's one `mtu 150` fatfinger away from potentially breaking again.

jchw · on March 23, 2020

I think mtu 150 would be noticeable quicklier than this fairly rare bug :) and probably any MTU short enough to cause a problem here probably. At least for busy servers, I’d expect to be very puzzled about a sudden dramatic uptick in network and probably CPU usage.

Edit: and it would probably break DNS as well.

q3k · on March 23, 2020

No idea. This 'fix' doesn't make any sense.

mst · on March 23, 2020

It makes perfect sense as "makes it 95% less likely to hit this bug."

Notice from the write up that they already also modified their code to be resilient against the bug being hit at all, so "reducing the number of times the resilience code needs to be invoked by 95%" on top of that is a net win.

de_watcher · on March 24, 2020

This fix is supper common. There are tonnes of software that freak out when you don't send a message in single call hopefully resulting in one packet. People don't understand that TCP alone doesn't really handle application-level messages.

Rapzid · on March 23, 2020

I'm skeptical and would like to see some etcd/golang github issues referenced on this.

k__ · on March 23, 2020

We used Discord for the "We vs Virus" hackathon last weekend and I was baffled how much better than Slack it is.

Everything runs much smoother, voice chat is like with TeamSpeak back in the days. Solid product.

_ytji · on March 23, 2020

The level of detail in this is pretty nice. But I also feel like it's too much- some of the comments here are poking holes in the write-up. Is there a mis-translation from geek-speak to lay-speak (maybe due to the short turnaround)? Is it better to not mention some of these details at all?

chasers · on March 23, 2020

It sounds like they were maybe pattern matching to strictly. If the value is always a string just save the empty string and try to discover that service. If it doesn't exist, oh well, it doesn't crash the process.

dirtydroog · on March 23, 2020

> unexpected deserialization error

I've had this with Boost.Serialization. Not fun. No exception, just a crash.

ayorosmage2 · on March 23, 2020

https://jepsen.io/analyses/etcd-3.4.3

intsunny · on March 23, 2020

"All times are PDT."

A daily reminder that one can do anything and everything in JavaScript, except display information in UTC.

sbr464 · on March 23, 2020

The reason (it seems) they stick to one timezone, are comments /updates can be added to status page incidents. It can be conflicting to talk about event times in a comment, and the post has different times. Here's an example from slack, where it's mixed, comments in PDT and posts in the local timezone. https://status.slack.com/2019-06-28. I think it's a setting in statuspage, but it's tough decide which is best. I definitely wouldn't prefer looking at it in UTC.

dlbucci · on March 23, 2020

This is just incorrect. I mean, Date.toISOString() will do that. Plus, it's very likely that this is just what's written in the post by the author and may have nothing to do with JavaScript. Or it could be JS that converts times to local time (I'm in PDT, so I can't tell), which is what a lot of sites do.

jchw · on March 23, 2020

During the handling of an incident, the last thing you want to think twice about is timezones, so it is understandable for Discord who started in PT to just use PT universally. (Eventually, or perhaps already, it will just become a weird quirk of history.)

ravedave5 · on March 23, 2020

FYI for anyone using discord - dont use the web client. Often users with web client can only be heard by certain people and vice versa.

ddevault · on March 23, 2020

It's a good thing that everyone makes their own discord "server", which naturally is resilient to any failures upstream by virtue of being its own fully contained, self-hosted software stack.

Wait, no, Discord just aped that word to gaslight users into behaving in a way which makes them more money.

Jasper_ · on March 23, 2020

The use of "server" to mean "instance" is very old. For efficiency's sake, multiple separate instances of old FPS games can be run in a single process, but I doubt you would object to the vernacular of "Quake server" as short-hand for "server-side Quake game instance".

Also, Discord internally calls them guilds.

rwiggins · on March 23, 2020

That does not seem like a very generous interpretation.

I'd wager that most of Discord's clientele don't have any specific expectations for the word "server". Regardless, I think "server" was carried over from the Mumble / Ventrilo / Teamspeak gaming community.

Slartie · on March 23, 2020

That's where it came from. Before discord, gamers set up actual servers with Teamspeak or Mumble and handed the IPs and ports to their guildmates. The "Teamspeak server" was where you met for voice chatting, and since Discord just wanted to supplant Teamspeak with their own product, they kept the "server" part in the name of the thing that gamers were supposed to meet for voice chatting.

It is a pure marketing ploy. Internally they call the "server" thing a "guild" AFAIK, which is technically just as much BS as the "server" moniker, but for other reasons (not every community using Discord is a "guild" in the commonly accepted meaning of that word in the gaming world, which is a long-term organizational group doing stuff together in an MMORPG).

zacksinclair · on March 23, 2020

I don't put this on Discord - the word "server" has a substantially different meaning to varying constituencies and Discord is certainly not the first to use it in a way different from CS101.

hombre_fatal · on March 23, 2020

> gaslight

People on the internet really love abusing this word, especially Twitter.

Lately, I've seen "he's gaslighting the world!" used in place of a public lie.

brian_herman · on March 23, 2020

I think it is to the point where the word has lost it's original meaning.

foota · on March 23, 2020

Discord servers are free. I fail to see how this was a move to make more money.

ddevault · on March 23, 2020

Ah, yes, I forgot that Discord runs their services out of the goodness in their hearts. That explains the $20B valuation.

gsich · on March 23, 2020

The valuation comes from data. Text and voice. The later to feed some AI.

sneak · on March 23, 2020

I thought I was alone in being above-average annoyed at their (IMO deceitful) use of the term. Thank you for posting this.

de_watcher · on March 24, 2020

They had to take the "server" term from TeamSpeak and other similar software. That's what gamers call the communication spaces.

Internally they name them "guilds" or something.