Discord Postmortem from Friday

jhgg · on Oct 16, 2017

It's worth noting that the instance migration basically null-routed the redis VM for a good 30 minutes, until we manually intervened and restarted it. The instance was completely disconnected from the internal network immediately following the migration. From what we could gather from instance logs, the routing table on the VM was completely dropped and it could not even connect to the magic metadata service (metadata.internal - we saw "no route to host" errors for that). This is a pretty serious bug within GCP and we've already opened a case with them hoping they can get a fix. I think this is the 4th or 5th major bug we've encountered with their live migration system that could have, or has led to an outage or internal service degradation. GCP team has seriously investigated and fixed every bug we've reported to them so far, so props to them for that! Live migration is incredibly difficult to get right.

We believe this triggered a bug in the redis-py python driver we use (specifically this one: https://github.com/andymccurdy/redis-py/pull/886) that made us have to rolling restart our API cluster in the first place, to get the connection pools back into a working state. redis-sentinel had appropriately detected the instance going away, and initiated a fail-over almost immediately following the instance going offline, but due to the odd network situation that was caused by the migration (absolute packet loss instead of connections being reset) - the client driver was unable to properly fail-over to the new master. We already have work planned for our own connection pooling logic for redis-py - as right now the state of the drive in HA redis is actually pretty awful, and the maintainer doesn't appear to have the time to close or look at PRs that address these issues (we opened one that fixes a pretty serious bug during fail-over in march https://github.com/andymccurdy/redis-py/pull/847 that has yet to be addressed).

fulafel · on Oct 17, 2017

For those of us unfamiliar with GCP, do you mean that the default-route of your VM was unable to route its traffic? Or is there a routing config running on customer VMs that GCP live-manages?

b1naryth1ef · on Oct 17, 2017

GCP has a virtual networking stack to support a bunch of crazy (and awesome) features Google has built. Unfortunately the complexity here seems to hurt power-users like us. In this case it appears that for some unknown reason the node failed to program its network stack when coming up, meaning it was completely unavailable (even the metadata service used internally by google failed).

cordite · on Oct 17, 2017

The level of detail and linearity is impressive.

At this scale, it seems like it may be warranted to start using reliability testing in production in like with Netflix.

At the end I see mention of a library with flaws. I am curious as to which library that is, given I develop some projects in Elixir.

b1naryth1ef · on Oct 17, 2017

Thanks, we try our best with these. Past experience has shown they can be very valuable, and help everyone at the company get context on the system and how we handle failures.

Reliability testing is definitely something we're interested in as we spin up more SRE/reliability focused individuals, but also has probably the least amount of cost-benefit for us (compared to engineering effort on improving the things we know need work). Some of the failure in the system we experienced is related to issues we know about, but haven't prioritized (read; had time for) yet.

For the library, we believe the bug is related to hackney and the fact it uses the high priority setting for its pool process. For some reason (this is the part we're not entirely sure on, and still spending some time investigating) this high priority process got stuck and consumed all of the scheduler time (presumably related to the earlier API degradation), breaking the distribution port and the application in a weird way. Oddly enough the systems we run on are SMP, so in theory one rogue process should not be able to have this effect.

cordite · on Oct 17, 2017

That is indeed very odd! Thank you for sharing. Hackney, through another library, is used in a telegram api wrapper that I wrote up. Though my stuff usually runs on a $5 vps, nothing with multiple cores.

phreack · on Oct 16, 2017

Ever since they launched screen sharing, I've uninstalled both Skype and Hangouts and relied entirely on it for pair programming sessions. The smoothness of the reproduction is just incredible, and I don't see myself going back soon.

ZeroCool2u · on Oct 16, 2017

I'm really impressed. I was using Discord for most of this weekend, specifically Friday and Saturday. Never noticed any issues.

aefx · on Oct 17, 2017

I was using it on Friday. My friend couldn't connect and I had trouble jumping into voice chat. I closed the client and was able to log in 10 minutes later. Overall I only noticed an issue for about 15-20 minutes. Having just read the post mortem I'm pretty impressed with their service and operations.

gizmo385 · on Oct 16, 2017

They don't have a posted outage for Saturday, but I noticed issues with it on Saturday evening/night I believe. I'm wondering if it was related to the issues that they included in the post-mortem.

b1naryth1ef · on Oct 16, 2017

Very possible you saw a slight interruption around 11:30PST for around 10 minutes until we found and decommissioned the host that experienced this problem. We generally don't update status until we can verify impact/source, we see tons of limited outages from ISPs misbehaving.

s_kilk · on Oct 16, 2017

I was recording a podcast through Discord, and got hit by this particular outage. To be fair, it’s the first I’ve seen first hand so

humanfromearth · on Oct 17, 2017

We had the exact same issue with RMQ (HA setup) on GCP (running on GKE) a few weeks ago. Tried contacting support about this, it's paid - no customer support for their own bugs.

The solution we came up so far is to disable automatic migrations. Not sure if that option actually does anything.

sleepydog · on Oct 17, 2017

You can't disable automatic migrations in GCP. You can choose between allowing live migrations (move the instance while it's still running) or (hard) instance reboots.

humanfromearth · on Oct 17, 2017

You're right. I meant hard reboots.

atomical · on Oct 16, 2017

Does anyone use discord for work?

katastic · on Oct 16, 2017

I have informally with co-workers. But not in any official capacity.

It's like 20x better than every other product out there though. And their new video chat + screen sharing is pretty great. The bandwidth is far higher than any other competitors I've used.

My brother and I were playing 1080p videos on each of our screens and watching the other's, just to test it out. Obviously it wasn't full quality, but it kept the frame rate up and looked presentable at least to 720p.

avree · on Oct 16, 2017

Weird, I find Discord's audio quality especially to be terrible.

And their lack of scalable monetization leads me worried about its longterm success as a platform - they are adding more cost-intensive features and continuing to try to support it with what is essentially a $5 monthly donation model.

b1naryth1ef · on Oct 16, 2017

Can you give more explicit examples of the bad audio quality you experience? I'd be happy to forward this onto our native team to look into if there are some concrete things they can look at. Generally 99% of the audio issues we see people experience are due to ISP/peering/DDoS/etc issues, most of which are handled automatically by our servers within a few minutes.

mooman219 · on Oct 16, 2017

Anecdotal, but direct calls are pretty unusable for me. I'm on the west coast, and when I attempt calls to the east cost, it frequently cuts out. The workaround was just creating a server and using a voice room in it. This had drastically higher call quality. I assume this is ISP/Peering related, but to see such a night and day difference between voice channels and direct calls leads me to believe that there's something that can be done on your end.

Complaints aside, I love the service. Echoing what avree said, long term monetization worries me. I would like to see discord survive, but its story looks similar to Trello right now in terms of monetization.

b1naryth1ef · on Oct 16, 2017

Ahhhh, thats actually something we're aware of. Currently direct calls run on an entirely separate set of metals vs. everything else (this was mostly to help us test/measure video & screenshare rollout). Unfortunately some providers seem to be having issues with DDoS filtering over-triggering when they see video traffic, which negatively impacts the whole server. Something we're hopefully fixing in the short term!

asddddd · on Oct 16, 2017

I've been using voice calls regularly, and it's occasionally been problematic. Yesterday it was behaving as though it had absolutely insane packet loss (occasional robotic sounding fragments would make it through, otherwise the line was dead) without any indication of high jitter or packet loss - I had to resort to stupid workarounds to get it working. Move convo to a server -> change server to Central from West (why can't this be done for regular calls!) -> instantly working perfectly.

FWIW, it generally works pretty well, and overall, Discord is a fantastic product that I'm happy to be using.

eropple · on Oct 17, 2017

FWIW, a serious feature that I would pay you money for right-here-right-now is the ability to multitrack audio. Let me give you many dollars to route each voice to a separate Soundflower channel, so that I can mix them outside of Discord, and I will give you said monies. I'm pretty sure you're even sending unmixed streams down? But I can't get at them!

This probably involves not being Electron (from my own adventures in the area), so I don't hold out much hope, but it keeps Discord from replacing Extremely Expensive And Bad solutions like SkypeTX for me.

exikyut · on Oct 17, 2017

The first thing I was thinking is using PulseAudio somehow. It has some bad image issues but its swiss-knife-of-audio-routing chops are undeniably present. It's Linux only though, so probably wouldn't be useful here.

I'm trying to figure out what the actual context in question is, particularly in terms of technical connectivity. Is this being used for remote DJing? Or conferencing? Or an audio recording situation?

If you're prepared to throw money at the situation, it's possible this may be fixable with a simple bespoke solution. I say "possible" because, unfortunately, I just did some digging and found https://bugs.chromium.org/p/chromium/issues/detail?id=453876:

> Unfortunately we don't support multi-channel > 2 nor multiple devices at the moment.

> ...

> Are there any future plans to support these two features? Is this a w3c issue or a Chrome issue?

> ...

> I am quite skeptical about this; I was told this requires a huge change in our WebRTC-side infrastructure, but I am not sure what the current status is.

> The spec indicates getUserMedia can be configured with 'channel count', so I assume this is Chrome issue.

That immediately nukes WebRTC :(

Could make for a fun project. I'm very fascinated with audio handling myself and this sounds interesting, but I'm unsure I'd personally have the skills (or mental stamina/attention span :< ) to be sure I could follow through. I'm also only on a Linux box, which brings up the platform-native problem.

eropple · on Oct 19, 2017

Sorry, just saw this. I need to split audio to mix and level it for stuff like live-streamed podcasts. So, on top of that, I need to pull video.

Honestly, the best answer is probably to continue using multiple Skype instances. Which is gross. But, y'know.

exikyut · on Oct 20, 2017

It's fine - you actually saw it, which is cool :) some of my other past replies have gone completely unnoticed

I see. I get the impression this is collaborative podcasting with multiple people that have multiple microphones. (I can't figure out why else you'd need multichannel A+V transport.) FWIW, it does sound like Skype is probably your best bet for the time being (unfortunately). It's simple, it works for everyone, etc.

callalex · on Oct 17, 2017

Doesn’t Mumble have that feature?

eropple · on Oct 17, 2017

Mumble doesn't, AFAIK, also do video/screenshare. Think of it like a teleconference. (I could be wrong though, it's been a long time.)

fletchowns · on Oct 17, 2017

Occasionally we notice the robotic voices when service has really degraded. But when it's operating normally, it seems like mumble sounds better.

It's also annoying having to adjust each users volume slider individually, I wish Discord would just do that automatically and normalize the levels of everybody.

gizmo385 · on Oct 16, 2017

One thing that I've noticed helps is swapping the region for the server. It doesn't result in any noticeable outage on your server and has resolved any voice quality issues that we've run into most of the time. Usually just swapping between US-Central and US-West in our case.

andyfleming · on Oct 16, 2017

I briefly thought about whether it would be a better option than Slack. However, Slack seems a little more polished in the areas that we actually use. The subtle, better design in places like the core chat experience aren't worth giving up.

I have used Discord quite a bit for gaming, but it hasn't proved a better option than Slack to me (at least in a work context).

sbarre · on Oct 16, 2017

We've debated it internally, but currently have decided to stick with Slack (free version)

eslachance · on Oct 17, 2017

I'll need to echo some of the other comments here - at work we're pretty much stuck with Skype for Business (aka "lync") because the voice system ties into it. When I suggested Discord, it was shrugged off because it's "Chat for Gamers" and clearly they're not going to budge from that niche anytime soon, from all the recent features I've seen.

synicalx · on Oct 16, 2017

Unofficially, for "out of band" discussions and also for on-call events where we need to collab on stuff.

We're stuck with Skype for official stuff and telephony though.

jakebasile · on Oct 17, 2017

We did, but due to some people having image issues about using a “gamer tool”, we were forced to switch back to the inferior Slack.

zanny · on Oct 17, 2017

[flagged]

eslachance · on Oct 17, 2017

That's nice misinformation, mate. Discord makes money from Nitro, and they started off with a fairly large investor cushion which grew since launch.

There is nothing in the ToS that states in any way, shape, or form, that they can sell their data. I challenge you to quote the exact place where it says that.

b1naryth1ef · on Oct 17, 2017

Yup this, our privacy policy plainly states that we're not in the business of making money from your data. We have various provisions which limit how and when we can share your data.

lightedman · on Oct 17, 2017

You don't look in the ToS. You look in the company articles of incorporation or bylaws which will explain things like how money is allowed to be made.

exikyut · on Oct 17, 2017

How would that work?

If a company drafts ToS and then plumb violates them, well, it'll be found out eventually and then you get things like the recent debacle with telco data being sold (https://news.ycombinator.com/item?id=15477286), and then the company(s) in question will either get completely shredded or their reputation will at least be seriously besmirched.

I can see the logic of what you're saying, but at the same time I can't see how it would hold up. If you can show me how it would work I'm listening though.

lightedman · on Oct 20, 2017

As soon as you start your own business, you'll know all of this. The short gist of it is when you incorporate, you have to publish rules that govern your entire operation. These are either your articles of incorporation or your bylaws, depending upon the type of entity you are.

If "Don't sell customer data, even surreptitiously" isn't in those bylaws or articles, then the company is free and clear to do that. ToS says how you can use THEIR service externally, not how they internally operate.

exikyut · on Oct 22, 2017

Oh. Thanks very very much for this clarification; most helpful TIL.

Apologies for not replying sooner; I didn't see this until now!

lwansbrough · on Oct 16, 2017

Funny definition of HA. :)