Hacker News new | past | comments | ask | show | jobs | submit login
How discord stores billions of messages (2017) (discord.com)
204 points by greymalik on Aug 26, 2022 | hide | past | favorite | 73 comments



This is quite valuable advise: "The original version of Discord was built in just under two months in early 2015. Arguably, one of the best databases for iterating quickly is MongoDB. Everything on Discord was stored in a single MongoDB replica set and this was intentional, but we also planned everything for easy migration to a new database"

Also the article links to Twitter blog, which gives similar point (it's from 2010): "We [Twitter] currently use MySQL to store most of our online data. In the beginning, the data was in one small database instance which in turn became one large database instance and eventually many large database clusters" [1]

[1] https://blog.twitter.com/engineering/en_us/a/2010/announcing...


The first version of reddit also used a single PostgreSQL instance apparently: https://github.com/reddit-archive/reddit1.0/blob/bb4fbdb5871...


Reddit is perhaps not the best company to look at for operational excellence. They seem to be barely keeping the ship aloft


Modern Reddit for me runs like ass and I have a 2021 4.2GHz 8 core Ryzen CPU. I can't imagine what it's like on older systems where even my work Intel 9th gen was braking the site.


It’s incredible how slow their website is. I understand (but am against) their motivation to push everyone to the app on mobile devices, but on desktop where am I going to go? My powerful computers both struggle to render the site.


I don't understand your issue. The website runs fine on our M1 Pro MacBooks we use for development so the problem must be on your end. /s


Runs like a old dog there too


old.reddit.com is still alive! RES still works as well.

You can live in a happy little bubble ignoring their current escapades


Seems like sound advice? Starting with DB clusters would be premature optimization, but designing your schema with the expectation that it will be on multiple instances later is sane. Kind of like building a monolith with the expectation that certain components will eventually be moved out into their own services.


Twitter was quite famous for their fail whale page which would come up often. Reddit also had a lot of trouble keeping up with user visits.

From a technical stand point, they aren’t great examples of scaling away from a simple mvp implementation.

They did manage to give users something so good that users were willing to put up with the outages.


They are good examples of not over-engineering at the beginning. Most early-Twitters don't become late-Twitter - they just die.


> Nothing was surprising, we got exactly what we expected.

Such a satisfying feeling in the engineering world.

> We noticed Cassandra was running 10 second “stop-the-world” GC constantly but we had no idea why.

This makes me very thankful for the work that the Go team has put into the go GC.

> In the scenario that a user edits a message at the same time as another user deletes the same message, we ended up with a row that was missing all the data except the primary key and the text since all Cassandra writes are upserts.

Does cassandra not offer a mechanism to do a conditional update? I'd expect to be able to submit a upsert that fails if the row isn't present, or has a `deleted = true` field, or something to that effect.


> This makes me very thankful for the work that the Go team has put into the go GC.

If you read that part of the piece again, the problem with the GC wasn't necessarily with Java's GC implementation but with Cassandra's storage architecture combined with a mass delete, which tombstoned millions of records and then Cassandra had to scan them and pass them over: "Cassandra had to effectively scan millions of message tombstones (generating garbage faster than the JVM could collect it)"

Even if you had an incremental GC, you'd be eating CPU and thrashing the cache like crazy doing that. And you may never catch up.

Actually even if you didn't have a GC language, you'd likely still have scan or free operations.

Writing databases is hard. And actually the various JVM's GCs are top knotch and have had hundreds of thousands of engineering hours put into perfecting them for various workloads.

Still, these days I'm glad not to be near it and I work in Rust or C++.


I'm not discounting that the GC had an overwhelming amount of work to do.

My comment wasn't intended to convey that the go gc is somehow more efficient, but rather that I'm thankful for the tradeoffs they've made prioritizing latency and responsiveness, and minimizing the STW phase. I can see how I didn't give enough context for it to be interpreted correctly.

It's been my experience that with larger STW pauses, you wind up with other effects -- to the outside observer it's impossible to tell if the service is even available or not. You could argue that if you're thrashing memory that hard, no, it's not available. In general though, it makes for higher variance in latency distributions.

> Writing databases is hard. And actually the various JVM's GCs are top knotch and have had hundreds of thousands of engineering hours put into perfecting them for various workloads.

no doubt on both accounts. again, thankful for the design tradeoffs.

> Still, these days I'm glad not to be near it and I work in Rust or C++.

For anything that really needs to be very consistent, or have full control execution, I don't blame you. Cassandra is a prime example of what happens when you don't have that full control.


I feel like GP is talking about GC at the database level, not the language level.

IE something like RocksDB - written in C++ - still has a compaction phase.


I think they're actually talking about both at once.

Cassandra definitely has compaction (similar LSM storage as RocksDB), and it causes pauses.

It's been over 10 years since I touched Cassandra, but when I did there was often a royal dumpster fire when compaction & JVM GC ganged up to make a big mess.


"Conditional update" requires consensus to be fault tolerant, and regular Cassandra operations don't use consensus. Cassandra has a feature called light-weight transactions which can sort of get you there, and they use Paxos to do it. It unfortunately has a history of bugs that invalidate the guarantees you'd want from this feature, but more modern versions of Cassandra are improving in this regard.


Yes. ScyllaDB also has LWTs based on Paxos. More here: https://www.scylladb.com/2020/07/15/getting-the-most-out-of-...

[Disclosure: I work for ScyllaDB]


> This makes me very thankful for the work that the Go team has put into the go GC.

Has the Go team hit on some GC algorithm that has escaped the notice of the entire compiler-creating world since 1995? Not to disparage Go at all, it's got brilliant people working on it, but you could boil the oceans at this point with the effort from the brilliant folks who have been working on JVM garbage collection for the last 20+ years.


The Go team has members with decades of experience writing garbage collection algorithms, including on the JVM, so it's not starting from scratch.

Go has value types and stack allocation that allow it to create _far_ less garbage than Java and means its heap objects violate the generational hypothesis-- because shortlived objects tend to stay on the stack. This means that the Go GC can comfortably operate with far less GC throughput than typical Java programs.

https://go.dev/blog/ismmkeynote


Java has done escape analysis and scalar replacement since 6.


> Has the Go team hit on some GC algorithm that has escaped the notice of the entire compiler-creating world since 1995?

It's like the saying about recycling: "1. Reduce 2. Reuse 3. Recycle". Just like in real life, it's far better to never have to go to all the way down to #3. If you never generate garbage in the first place, you won't have to waste resources cleaning it up. So it's less about how good the actual Go GC is and more that the language itself greatly discourages wasting memory via aggressive stack allocation, value types, and an allocation-conscious culture.


It actually has very poor throughout when compared to almost any JVM GC. This is mitigated somewhat by the commonality of value types.

The major advantage is mostly predictably short pauses. For web services it’s a reasonable trade off, but it can be a problem that the GC can’t be tuned for something else.


(2017)

As revealed in this blog post[0], they now see 4 billion messages per day.

0: https://news.ycombinator.com/item?id=32474093


And switched to ScyllaDB.


Around ~11k a second. Not small but not insane scale either.


4,000,000,000/(24 • 60 • 60) ~= 46,000

And their mps definitely scales based on time-of-day, illustrated by this blog image[0] from this blog[1] (given their CPUs are only really handling messages).

0: https://assets-global.website-files.com/5f9072399b2640f14d6a...

1: https://discord.com/blog/why-discord-is-switching-from-go-to...


as someone that's worked with a similar workload (edit: in terms of write volume, not persistent chat messages), it's an incredibly annoying volume.

Too high for simple solutions, not big enough for large scale solutions.


I haven't used Cassandra for about 3 years, and this is awakening memories. At a previous company I inherited a very badly assembled cluster that was used for a time series database. The guys who built it said (and no, I'm not kidding...) "we don't need to put a TTL on metrics because they're tiny and anyway we can just add more nodes and scale the cluster horizontally forever!". Well, forever was about 2 years, when the physical data center ran out of rack space, and teams abused the metrics system with TBs of bullshit data. That was when they handed the whole metrics system to yours truly. And I discover two things:

1. You can't just bulk delete a year of old stale data without breaking it

2. "Woops, did we really set replication factor to 1?"

Fun.


What was the issue with the bulk deletes? Locking the cpu?


[sorry if you're reading in real-time, lots of edits]

It's mentioned briefly in the original article. Cassandra achieves extremely good write performance by being append-only, which means that is disk IO patterns are sequential, with little random IO. This gives great performance on crappy spinning disks. However, with append-only, updating or deleting existing data is handled by writing new data or "tombstones".

When you perform a search, Cassandra still scans all matching data in the table, including that which was deleted. It then uses tombstones to "ignore" deleted data. In other words, lazy delete is great for write performance (Cassandra is optimized for write), but creates overhead during read.

So to avoid killing the cluster during a search, Cassandra imposes a tombstone limit. If a search contains more tombstones then the limit, the search will be abandoned. Real world example: if you're using Cassandra as a time series database and you don't add a sensible TTL to your metrics, a bulk delete would cause searches to suddenly fail, and they wouldn't work again until after compaction is performed on the table, rewriting the whole table minus data that has tombstones older than gc grace period.

Metrics systems aren't just used for eye-baling dashboards during an incident, they're part of your alerting system (usually) so this could lead to missed alerts or false positives.

This is why it's important to continually "age-out" data with a TTL, so that compactions are regularly removing tombstones older than the gc grace period.

"Fucking gc grace period!" (me, on a few occasions)

Why is there a gc grace period in here, making trouble for us? Because Cassandra is a distributed, replicated system that handles partitions and failures, it knows that some nodes may be temporarily unreachable. So it holds on to tombstones long enough to ensure that every node gets them. If it didn't do this, an offline mode may not get the tombstone before it's compacted out of existence in all other nodes. So when it comes back online you get zombie deleted data, back from the dead! Not nice!

If you fuck this up and end up with a huge set of old data that you must delete, you are in for a treat. Easiest option is to manually schedule a bunch of operations to run during a quiet period. With metrics, write traffic volume is pretty linear but read traffic is user-driven. That could mean that quiet period are nightly, on weekends etc. The operations involve checking for downed nodes, modifying gc grace period, deleting a bunch of data and running a compaction. You have to carefully monitor logs for errors, and keep an eye on tombstone limits. Depending on your infrastructure it might be easier to build a new cluster side-by-side, write all the data from the old cluster into the new one with proper TTLs and delete the old cluster. If you're on bare metal and are out of space, you might have to get creative. (To keep a production metrics system running for a few years I had to get extremely creative)

Sorry for the huge thread and possibly wrong info, it's a while since I touched Cassandra, but I still have mental scars from the people who set it up wrong.


Really good brief explanation, thank you!

I can feel your exasperation from here.


I don't know databases and there's quite a number of them on the market, so posts like these are great.

Sidetrack though, does anyone have a list for pros and cons of each db, with a preference towards low latency? Also how does it compare with say Postgres?


That depends heavily on the shape of your data, what your workload looks like, what sort of consistency guarantees you need, etc. I recommend Designing Data Intensive Applications for getting a handle on this - it's the book I suggest to SDE IIs who are hungry for the jump to senior. Not a quick read, but well written, and there's not really a shortcut to deep understanding other than deep study.


> I suggest to SDE IIs who are hungry for the jump to senior

Any suggestions for seniors who are hungry for a staff title. Is there any value to specializing in tech anymore after senior or is it all 'soft skills' at this point ?


It is not all any one thing. If you’re completely lacking in soft skills, you’ll probably have more trouble the higher up you go, and you may get stuck at some point. Hard technical skills become less important the higher you go, but even the CEO of a tech company probably still needs to have some basic understanding of technology. Being good at both things is obviously the best way forward.


I enjoyed this: https://staffeng.com/book

Not a staff though so can't attest to how true/accurate/effective it is.


thanks, will look that up


I did a talk about establishing criteria for differences between distributed SQL and NoSQL databases.

https://www.youtube.com/watch?v=nJvnq8ZD_7Q

There are a couple of different dimensions to consider. I didn't focus on this in the talk but firstly you should consider these elements...

• Speed (latency & throughputs)

• Scale (gigabytes? terabytes? petabytes? exabytes?)

• Workload types (OLAP, OLTP, HTAP, data warehouses/lakes/lakehouses)

• Consistency Models ("ACID vs. BASE", eventual vs. strong, etc.)

• Data Model - RDBMS, NoSQL (Key-Value, Wide Column, Document, Graph, Column Store, etc.)

• Query Language (SQL, streaming SQL [Flink], CQL [Cassandra, ScyllaDB, DataStax, CosmosDB, Keyspaces, Yugabyte, etc.], Gremlin/Tinkerpop [JanusGraph, DSE Graph, OrientDB, etc.], and so on.

What I did focus on was more the "distributed system" attributes...

• Clustering & distribution (local clustering, cross-clustering, multidatacenter clustering)

• Node roles, high availability & failover strategies (primary/replica, peer-to-peer leaderless/active-active, load balancing)

• Data replication/sharding strategies (replication factors, consistency levels, manual vs. auto-sharing, topology awareness [rack and datacenter awareness])

I only focused on a handful. There are literally hundreds of databases tracked on DB-engines:

https://db-engines.com/en/ranking

[Disclosure, I work for ScyllaDB]


None of Bigtable, Spanner, or DynamoDB made your shortlist?


For the sake of that talk I focused only on open source you could self-host.

SQL • PostgreSQL • CockroachDB

NoSQL • MongoDB • Redis • ScyllaDB [Cassandra]

There are a lot of points in the DBaaS world to also consider, but there's also issues of visibility — can these issues be known, perceived or understood from a user's POV? Are they documented? Or are some of these considerations obscured and trade secreted? So I had to pare down.

I planned to do a whole different talk on DBaaS systems, and yes, I'd likely shortlist Bigtable, Spanner, DynamoDB, and a few others for that.

Only so much to do in an hour! :)


Unfortunately, they're not doing a good job at deleting them. If you press the "Delete Account" button, all it does is anonymize your profile, and leaves all of your messages intact. One of the reasons I avoid using Discord whenever possible.


Why is HN acceptable then?


HN is a public forum, not behind a pay or auth wall.


You need to login to your account, so how is that not an auth wall?


You don't need to be logged in to view other people's comments though


that's true


Just treat it the same way, assume everything you will say on discord will become public. Problem solved.


Agreed. People like to make excuses for the poor functionality of HN. We should be able to delete our comments.


Sorry, buy you have the wrong idea. That would be fine for public servers, but not for Private Conversations. I'ts unacceptable.

edit:typo


I can't think of a good reason why the presence or effectiveness of a delete button should change what you write online, even in a so-called "private conversation", or whether you use a certain platform.

Delete is useful, but it shouldn't change what you're willing to send online because delete doesn't time travel and unsend the message.


I do not consider "online", when it's a PM. A private message, should be private. There are reasons to get rid of the whole thing sometimes.


If you are typing anything you desire to stay hidden forever into a completely unencrypted and openly data mined and brokered (read your privacy policy) cloud-based solution then I have a bridge to sell you...


Private conversations aren't really very private on Discord. Attachment URLs are globally visible.


If you individually delete messages, then yes, they are actually deleted.


The only effective way to delete messages individually is by using automation software, which can get your account banned or locked.


As a point of comparison, Slack uses MySQL (Vitess) – https://slack.engineering/scaling-datastores-at-slack-with-v...


I believe they use ScyllaDB exclusively now for storing messages


'build quickly to prove out a product feature, but always with a path to a more robust solution"

Yes


Discord doesn't respect privacy, you cannot just get rid of a whole conversation. Users are the product, and they make it so difficult to delete entire convos that it's so obvious it's just valuable to them.


You can't on HN either.


on HN the convos aren't private.


Discord is a closed-source service hosted by a 3rd party on the internet. It's by definition not private either.


Hacker News doesn't have "private messages", Discord does.


My point is neither has private messages. Discord has direct user-to-user messages. Those things are not the same.


Can say the same thing about IRC and logging…anything online is permanent


For public servers I understand, for private messages obviously not.

edit: better wording


I had a bigger issue with requiring a phone number than the lack of ephemeral posting. The later is something I'm used to at least.


Wonder if Postgres would scale to such volume.


Maybe its possible with Citus.

But I wouldn't try, it's not the right tool for the job.


how many are bots? AFAIK bot traffic is higher than human there. Plus they added so much bot APIs that now bots are hacking and spamming users' accounts


Bot traffic and human traffic aren't mutually exclusive, bots can be humans:

https://t2bot.io/discord/


Elsewhere in social media and gaming many accounts may have both manual and automated activities associated with it.

For example, in a WoW chat, that might be your friend typing "lol," or it could be an add-on putting out an automated raid warning.

Similarly, your Twitter account could be a manual post, or a scheduled one from Tweetdeck.

Are you now a "bot account?"

Discord does permit bots in its system. But it tries to limit "self-bots" on the platform. It wants humans to be humans and bots to be bots. "Self-bots" in Discord could result in termination. Read more here:

https://support.discord.com/hc/en-us/articles/115002192352-A....




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: