I love write-ups like this because they are such a nice contrast to the too-common comments on Reddit and HN where people claim that they could rebuild FB or Uber as a side project. Something as superficially trivial as the r/Place requires a tremendous amount of effort to run smoothly and there are countless gotchas and issues that you'd never even begin to consider unless you really tried to implement it yourself. Big thanks to Reddit for the fun experiment and for sharing with us some of the engineering lessons behind it.
> the too-common comments on Reddit and HN where people claim that they could rebuild FB or Uber as a side project.
I do and don't agree with you. Whats really going on here is that development time scales linearly with the number of decisions you need to make. Decisions can take the form of a product questions - "what are we making?" and development questions - "how should we implement that?".
There are three reasons why people feel confident saying "I could copy that in a weekend":
- When looking at an existing thing (like r/place), most of the product decisions have already been made. You don't need to re-make them.
- If you have a lot of development experience in a given domain, you don't need to make as many decisions to come up with a good structure - your intuition (tuned by experience) will do the hard work for you.
- For most products, most of the actual work is in little features and polish thats 'below the water', and not at all obvious to an outsider. Check out this 'should we send a notification in slack' flowchart: https://twitter.com/mathowie/status/837807513982525440?lang=... . I'm sure Uber and Facebook have hundreds of little features like that. When people say "I could copy that in a weekend", they're usually only thinking about the core functionality. Not all the polish and testing you'd actually need to launch a real product.
With all that said, I bet I could implement r/place in less than a weekend with the requirements stated at the top of that blog post, so long as I don't need to think about mobile support and notifications. Thats possible not because I'm special, and not because I'm full of it but because most of the hard decisions have already been made. The product decisions are specified in the post (image size, performance requirements) and for the technical decisions I can rely on experience. (I'd do it on top of kafka. Use CQRS, then have caching nodes I could scale out and strong versions using the event stream numbers. Tie the versions into a rendered image URL and use nginx to cache ... Etc.)
But I couldn't implement r/place that quickly if reddit didn't already do all the work deciding on the scope of the problem, and what the end result should look like.
In a sense, I think programming is slow and unpredictable because of the learning we need to constantly do. Programming without needing to learn is just typing; and typing is fast. I rule out doing the mobile notifications not because I think it would be hard, but because I haven't done it before. So I'd need to learn the right way to implement it. - And hence, thats where the scope might blowout for me. Thats where I would have to make decisions, and deciding correctly takes time.
First off, it's a bit unfair to me that you were the one to make this comment, since your very extensive experience and deep knowledge of real-time systems makes you uniquely qualified to disprove my point!! hehe
But in all seriousness, I agree with most of what you said - I think I'm just more bearish on people's ability to infer those many decision points without being given the blueprint like we were in this article.
If you are a senior engineer at FB and you decide to make Twitter in your spare time, I buy that lots of experience and knowledge gained at your job can probably get you going fairly quickly. But I have never seen an example of engineers discussing sophisticated systems like these where crucial aspects of its success in terms of implementation didn't rely on some very specific knowledge/study of the particular problem being solved that could only be gleaned after trying things out or studying it very carefully. The representation of the pixels is a great example in this case -- they go into wonderful detail about why they decided to represent it the way they did, which in turn informs and impacts how the rest of the stack looks like.
I think at one point Firebase had it as one of their example apps something which very closely mirrored what they did with r/Place, so I agree that one could probably build some "roughly" like it somewhat quickly. I agree that in general knowledgeable individuals could probably grok things which would in some form resemble popular services we know today. The devil is in the "roughly," though. I think that often what makes them be THE giant services we know are things that let them have the scale which very few of us have ever needed or know how to deal with, or because they have combined tons of "below the water" features and polish like you mentioned. When really basically most web apps that we use are CRUD apps which we all know how to build, I think maybe we need to give more weight to these "below the water" features in terms of how much they actually contribute to the success of the applications we use.
> The representation of the pixels is a great example in this case
That's funny because I thought (assuming you're referring to the bit-packing of the pixels) that seemed to me one of the more obvious choices (to do something else would have been more remarkable to me).
Beyond that though, I have zero experience with real-time networked systems. Especially with a gazillion users and that everybody gets to see the same image, that seems hard.
The cleverest solution that I read in the article, that I really liked but probably would never have thought of myself (kinda wish I would, though) was the part where the app starts receiving and buffering tile-changes right away, and only then requests the full image state, which will always be somewhat stale because of its size, but it has a timestamp so they can use the buffered tile-changes to quickly bring its state up to time=now. Maybe this is a common technique in real-time networked systems, I don't know, but it was cool to learn about it.
Thank you for saying this. I was going to say the same thing. The difference between facebook and my side project is not the core functionality. Anyone competent can build something that seems to look and act like facebook in a weekend or two in the general sense of "look and feel." Which is also perhaps a dangerous way to go, lest you get sued.
It's getting the nitty-gritty details hammered out and then implemented that takes time and people. I'm currently trying to do something similar but directed in a very different way. Facebook feels like a CRUD app. But there's a lot of different moving parts that you have to deal with on top of the CRUD pieces.
And every single one of those things--while not necessarily difficult to implement individually--add up to tons of decision time.
When the expected outcome is already decided, the problem becomes much much easier to solve for a developer.
Assuming that you're dealing with some kind of a reasonable web stack, implementing any individual feature is often not that big a deal. Deciding to do it, and also perhaps making the choice to have a reasonable web stack to begin with, are potentially problematic.
And of course, building something that can massively scale is hard as well.
But from my experience with side projects and also my day job, the hard part is making decisions.
I wish I had enough spare cash to take you up on a bet that you could implement this in less than a weekend. Maybe HN can crowdfund paying you for two days at $1000/day to replicate, versus your time and a $3000 rebate if you can't.
My loved ones are out of town, so you're on! I'll do it for the fun of it and the experience. And because I still haven't built anything on top of kafka, but there's no time like now to fix that.
The challenge is: Meet the main requirements of r/place: 1000x1000 image, a web based editor, 333+ edits / second, with an architecture that can scale to 100k simultaneous users (although that part will be hard to actually test). I won't implement mobile support or notifications, and I won't implement any sort of user access control (thats out of scope). The challenge is to do it & get it hosted online before I go to bed on Saturday (so, I'm allowing myself some slop there). But its already 1:30pm on Friday, so I think that easily qualifies as "less than a weekend".
As a stretch I'm going to write all the actual code in nodejs while aiming for well into the thousands / tens of thousands of writes per second territory.
I'm willing to accept some fun stakes if anyone wants to propose them (ice bucket challenge level stuff). I'll live-tweet progress here - https://twitter.com/josephgentle
Complete and working. I haven't tested the load for real, but judging by the performance numbers it should be able to handle 100k connections fine with 2 beefy backend servers. (40% of one of my laptop's cores handles 15k connections with 400 edits per second. So about 8 cores should handle 100k users no sweat.) The bottleneck is fanout, so until kafka starts hitting subscribe limits it should scale linearly with frontend servers. (And kafka is a long way from hitting that limit because I've added write coalescing.)
I'll post a writeup about it all once I've eaten some real food and had a rest. But for now, enjoy!
I like what you are doing a lot. It doesn't matter that it isn't a one to one comparison, the point is you are taking on the challenge and enjoying it. Probably actually you are learning some stuff along the way as well.
It would be kind of neat if the place experiment could become something of a micro-benchmark for online customer facing distributed platforms.
Well, the server needs to message the client with updates somehow. Either the client keeps a connection open or it has to periodically poll the client. Keeping a connection open is faster and it uses less system resources.
Mobile support was a huge factor for /r/place. To reimplement the project while completely ignoring mobile is basically taking out maybe 30% of the effort required. It's a bit of a copout to claim "can implement in a weekend", while discounting one of the main reasons why it is nearly impossible to hammer out solo in a single weekend.
The other "big ticket item" is integrating such a system into an existing live stack. It's very convenient that you get to pick the tools you think are most efficient for the job from scratch; it's quite another to restrict yourself to an existing set of tech.
>> an architecture that can scale to 100k simultaneous users (although that part will be hard to actually test)
Also this. Reddit had a single opportunity to deploy live. You can't throw out many days of required dev/staging benchmarking for a system that must work on first deploy. You're not "done" after a weekend unless your first deploy both works 100% and scales as required.
tldr; Were I to do what you are doing, it would be out of interest and to put it live for others to use. What seems disingenuous is to do so with the primary goal of proving that "see, child's play - it's not difficult at all!". The goal should be to build something that works, not to rush the job as some way to show off and take something away from Reddit's efforts.
Absolutely! And I'm not going to claim otherwise. I said above I think I can only implement it in a weekend if I can work to my technical strengths. Doing otherwise would mean learning, which is super slow compared to typing. Just implementing to a closed spec cuts out maybe half of the work of something like this, because you don't need to figure out the requirements or waste time going down blind alleys.
From my post above:
> But I couldn't implement r/place that quickly if reddit didn't already do all the work deciding on the scope of the problem, and what the end result should look like.
Upvoted you. I'm a very cynical person, and thus I focused on the "weekend" aspect as being more of an attempt to refute the claim that Reddit had to put in quite a bit of effort to accomplish what they did, rather than you simply limiting how much time you're willing to sink into it.
If anything, this only increases my motivation to replicate the project myself, whether it's during a weekend or two full weeks. It's interesting enough and at the right level of complexity - kind of simple, but not too simple - to make it a fun side project.
Errr... oh, that thing is happening where its a public holiday so I'm a bit unmoored from reality. I'll get it done by Saturday night, not Sunday night.
Unless you stick with a single Kafka partition (which might not withstand the load) I don't think you can consume the events in exact chronological order.
A single kafka partition should be able to handle the load - it only needs to write 333 ops/second, and I could coalesce several edits together over half second windows if I wanted to make that more efficient. And I'll only have as many kafka consumers as I have frontend servers. So we're talking about 333 ops/second and like, 5 consumers of those operations. Somehow I think a single kafka partition should be able to manage the load :)
But there's no reason I couldn't use multiple partitions if I had to. To make it fully consistent I'd just need a consistent way to resolve conflicting concurrent writes on different partitions. Either adding a server timestamp to the messages, or using the partition number to pick a winner would work fine for this. It would just make the server logic a little more complicated.
Simulating hundreds to thousands of writes per second via post requests will be easy. Simulating 100k simultaneous readers of an eventstream will be much harder. I don't know of any tools which can load test that many readers. Do you have any ideas?
locust can spawn arbitrary scripts; you'd have to write your own eventstream reader, and simulating 100k of them will take a few machines concurrently load-testing, but i think this should be the easiest way to do it.
The core functionality that Facebook relies on exists in most open source web forums. A good developer could have easily built it on top of PHPBB back in the day - and probably had a more maintainable codebase.
Yes - today's Facebook could not be built by a dev in a weekend, but Facebook 1.0? Not so hard.
I have this hunch that Reddit engineers know what they're up against when they launch something like this. It can get out of control extremely fast when something catches on.
Why do you think it's over engineered. The scaling issues made sense. Cdn was a great move to limit throughput of requests. Redis made sense, javascript canvas and typedarrays made sense. How else would you have done it?
For the core application of writing pixels to the board, I would of used a more efficient monolithic design with less points of failure. One big server could handle 333 pixels/s(their expected volume). Not sure that using cassandra and redis is really necessary or makes things easier.
The secondary goal of pushing pixel updates to lots of users would probably want separate hardware. I would probably fake it so its not really realtime. If you send the web client batches of updates to replay on the board it looks realtime and doesn't require persistent connections.
What a fantastic writeup. I had some vague ideas regarding the challenges involved to build an application of such scale, but the article really makes it clear for everyone the amount of decision points encountered as well as why certain solutions were selected.
I also like the way the article is broken down into the backend, API, frontend and mobile. This isolated approach really highlights the different struggles each aspects of the product has, while dealing with what is essentially a shared concern: performance.
What I also found interesting is the fact that they were able to come up with a pretty accurate guess in terms of the expected traffic.
> "We experienced a maximum tile placement rate of almost 200/s. This was below our calculated maximum rate of 333/s (average of 100,000 users placing a tile every 5 minutes)."
Their guess ended up being a good amount above the actual maximum usage, but it was probably padded against the worst case scenario. The company that I work for consistently fails to come up with accurate guesses even with our very rigid user base, so it's pretty impressive that Reddit could accommodate the unpredictable user base that is the entire Reddit community.
I really like the "Big Data in Real-Time at Twitter" slides at https://www.slideshare.net/nkallen/q-con-3770885. They explain their original implementation and why it didn't scale, possible new implementations and their current one (it's a bit old though). It's clear and easy to understand.
Documentation always takes me as long or longer than designing, coding.
Most recently, when I released a novel layout manager, the docs, examples, screenshots, etc took roughly twice as much effort as everything else combined.
> We used our websocket service to publish updates to all the clients.
I used /r/place from a few different browsers with a few different accounts, and they all seemed to have slightly different view of the same pixels. Was I the only one who experienced this problem?
When /r/place experiment was still going, I assumed that they grouped updates in some sort of batches, but now it seems like they intended all users to receive all updates more or less immediately.
Yeah, we went into it a bit in the "What We Learned" section, but that was most likely during the time we were having issues with RabbitMQ. I believe it was mostly fixed later on, but either way, we found a new pain point in our system we can now work on.
Surprised you're using RabbitMQ. It's one of those things which work great until they don't (clustering is particularly bad), and then you have almost zero insight into the issue, and have to resort to the Pivotal mailing list.
Have you looked at NATS at all? We're using it as a message bus for one app and it's been fantastic. It is, however, an in-memory queue, and the current version cannot replace Rabbit for queues that require durability.
i've been using rabbitmq heavily (as in, the whole infrastructure is based on two rabbitmq servers) for a long time and i've never seen it fail.
tbh, i never used clustering (because it's one of the shittiest clustering implementations i've ever seen) but we do use two servers (publishers connect to one randomly and consumers connect to both) and it seems to handle millions of messages without any issues.
of all servers i've ever used, rabbitmq is by far the most stable (together with ejabberd).
RabbitMQ is decent if you don't use clustering (which, I agree, is shitty). I have some quibbles with the non-clustered parts, but nothing big.
Right now, the main annoyance is that it's impossible, as far as I understand, to limit its memory usage. You can set a "VM high watermark" and some other things, but beyond that, it will — much like, say, Elasticsearch — use a large amount of mysterious memory that you have no control over. You can't just say "use 1GB and nothing more", which is problematic on Kubernetes where you want to pack things a bit tightly. This happens even if all the queues are marked as durable.
yeah we have dedicated machines to rabbitmq because it's basically memory hungry. but i like it that way because it's only going to crash if the machine crashes.
Note that NATS is currently pub/sub, which is a "if a tree falls in the forest" situation. Messages don't go anywhere if nobody is subscribing.
So it's awesome for realtime firehose-type use cases where a websocket client connects, receives messages (every client gets all the messages, although NATS also supports load-balanced fanout) for a while, then eventually disconnects.
NATS is ridiculously fast [1], too.
There's an add-on currently in beta, NATS Streaming [1], which [2] has durability, acking/redelivery and replay, so covers most of what you get from both RabbitMQ and Kafka. It looks very promising.
I experienced this as well. I have a different account logged in on mobile than what is logged in on my desktop. I wouldn't say things were drastically different, but when there was a location with a ton of activity (like OSU or the American flag towards the end), I saw different views between them.
I thought it was interesting that one of their requirements was to provide an API that was easy to use for both bots and visualization tools. I remember reading some speculation when this was running that r/place was intentionally easy to interface with bots, while there were also complaints that the whole thing had been taken over by bots near the end.
Without bots, I doubt that /r/place would have been very interesting. It's a nice thought that a million random strangers can be cohesive without automation, but for some reason I don't find that to be particularly realistic..
As a concrete example, as far as I can tell the entire Puella Magi Madoka Magica section, starting from Homura Did Nothing Wrong next to Darth Plagueis The Wise, was hand-crafted and hand-maintained. On their discord they were actively discouraging community members that wanted to use bots.
I remember taking part in a big drawing canvas exactly like this about twelve years ago between several Art/Photoshop communities, Worth1000.com, SomethingAwful, Fark.com etc. There wasn't bots at the time but it was still very socially interesting.
A long time ago, I built http://www.ipaidthemost.com/, which is kinda related, at least to TMDH anyhow. Far, far less collaborative than /r/place, but similar in terms of staking out ownership.
Actually this idea has been around for years, and sadly isn't new at all. I just checked and there is one that's been around since at least 2006, http://da-archive.com/index.php?showtopic=42405
Right, true. Collaborative drawing was basically the "hello world" of real-time platforms back in the day (Firebase, Parse etc). But I think most of those were ephemeral canvases.
Yeah, collaborative drawing is kind of old hat, but when you can use the context of the modern, social web to provide some new modes of interaction around it, it can be interesting again. Same applies to more mundane things like text.
Not quite the same, but my startup, Formgraph[0], also does public, real-time collaborative drawing. I even did a similar write-up about the stack behind it the other day[1]. Looks like they're relying on Cassandra and Redis. I went with RethinkDB. Should probably do a write-up about the front end in the future.
Some redditors have created /r/place derivatives already. I'm not aware of one prior to /r/place but it seems impossible that it hasn't been done before
There were definitely a few that popped up after /r/place closed down. (pxls.space being the most popular) They were nowhere near as successful as /r/place though.
That was not realtime nor collaborative on a single pane. It was a "remix" platform where one user created the initial drawing, then people used it as a base and drew around/over it to change it - in a new picture.
Why use Redis and multiple machines instead of keeping it in RAM on a single machine? I'm not claiming the Reddit people did anything wrong; they have a lot more experience than me here obviously. I'm just trying to figure out why they couldn't do something simpler. 333 updates/sec to a 500kB packed array, coupled with cooldown logic, should have a negligible performance cost and can easily be done on a single thread. That thread could interact via atomic channels with the CDN (make a copy of the array 33 times a second and send it away, no problem) and websockets (send updates out via a broadcast channel, other cores re-broadcast to the 100K websockets). Again, I'm not saying this is actually a better idea, this is just what I would do naively and I'm curious where it would fall apart.
> 500kB packed array, coupled with cooldown logic, should have a negligible performance cost and can easily be done on a single thread. That thread could interact via atomic channels with the CDN
That's not simpler.
We used tools that we're already using heavily in production and are comfortable with.
With respect to your experience in the matter, I strongly disagree. What I described is complicated to say, easy to implement. What the OP describes ("use redis") is easy to say, complicated to implement. Not just in terms of human work time (setting up the redis machine and instance, connecting everything together), but also in terms of number of moving parts (more machines, more programs, etc.).
> We used tools that we're already using heavily in production and are comfortable with.
That's entirely fair, and what I figured was the most likely explanation.
I agree that machines go down, but there are sane (and safe!) ways to build this sort of thing without adding in cassandra and Redis. Additionally, the max placement rate of 333/s is reaaaaally slow! Maybe that's due to the websocket frontends, not the DB, but, that doesn't mean that's the most obvious way to build it.
The crux of the problem is that they need to mutate a relatively tiny amount of memory and have a rolling log of events for which only the last 5 minutes needs fast access. Also, if you can put all your state on one machine its far less likely that the one machine will die, than it is that at least one will die in a cluster of machines. Given the nature of the problem keeping all state on one machine seems pretty rational to me, so long as you have the ability to switch to a hot spare within a few minutes or so.
If I were to architect this for speed I would have two tiers: a websocket tier, and secondly a 'database' tier. The database would be a custom program that would:
0. Provide a simple Websocket API that would receive a write request and return either success if the user's write timer allowed it to write or failure if it didn't. This would also broadcast the state + deltas.
1. Keep the image in memory as a bitmap
2. Use rocksdb for tracking last user writes to enforce the 5m constraint. You could use an in memory map, but the nice thing about rocksdb is that it shouldn't blow up your heap.
3. Periodically flush the bitmap out to disk to timestamped files for snapshots
4. Keep the hashmap size small by evicting any keys past their time limit
5. Write rotating log files rotated every 5m or so to record the history of events for DR and also later analysis
Backing this sort of thing up is very simple. You just replicate the files using rsync or something like it. You may have some corruption on files that are partially written, but since we're opening and closing new files often you can choose how much data-loss you want to tolerate.
Restoration is as simple as re-reading the bitmap and reading the log files in reverse up to 5 minutes ago to see who still isn't allowed to write yet (thus reconstituting the hashmap). Let's remember, redis replication is async, so this has the same tradeoffs.
> The database would be a custom program that would:
Creating a "custom database program" is not a small task.
We like to use boring technologies that we know work well. We were already using Cassandra, had some experience with Redis, and had a lot of confidence in our CDN.
Well, in your article you mentioned that you tried to use Cassandra for one task and had to jettison it because of unexpected performance problems. You had to re-approach the problem with a whole other DB. I would say contradicts the point you're making.
I'm not arguing that most problems need a custom database, only a minority do. I'd say that this problem is borderline on which direction to go.
Databases are very leaky abstractions as you all discovered. The nice thing about custom code is that you don't have leaky abstractions. The bad thing about custom code is that you have a large new untested surface area.
In the case of your application the requirements are so minimal, a bitfield plus a log, I'd say its a wash.
Programmers today forget that things like flat files exist and are useful. It's a shame, because you wind up with situations where people just assume they need a giant distributed datastore for everything.
What you're doing in that case is trading architectural complexity for code complexity. Now, if its the case that all data in your org goes in one data store to keep things consistent, great, that makes sense. But for a one off app I just don't buy it.
And there's only half a meg of data! Serving the readers is a much more interesting problem than managing the writes, and tbh I'd just keep a rolling "pixels that have changed in the last 10 seconds" diff going, pushed out every 100ms, compressed and cached, and have clients poll for it. Easy peasy, websockets just complicate life.
> Users can place one tile every 5 minutes, so we must support an average update rate of 100,000 tiles per 5 minutes (333 updates/s).
It only takes a couple of outliers to bring everything down. I'm not exactly well-versed in defining specs for large scale backend apps (not a back-end engineer) but it seems to me that preparing for the average would not be a wise decision?
For example, designing with an average of a million requests per day in mind would probably fail, since you get most of that traffic during daytime and far more less at the nightly hours.
>We should support at least 100,000 simultaneous users.
This line makes me think that this is what they expected the peak (or near peak) to be.
>Users can place one tile every 5 minutes, so we must support an average update rate of 100,000 tiles per 5 minutes (333 updates/s).
So assuming that they mean that 100k is the peak and that clients are limited to 1 update per 5 minutes, they can expect 333 updates per second on average. The "average" is taken over this 5 minute period. This average represents the number of queries per second they will get if everyone's 5 minute cooldown is spread out evenly over each 5 minute period.
It is possible, for example, for half of the peak population's cooldown to expire at 1300 and the other half to expire at 1305. In this case the average updates/s over the 5 minute period from 1300-1305 would still be 333 updates/s even though there were really 2 bursts of 50k a second at 1300 and 1305. It's far more likely that cooldowns are not excessively stacked in this way, so you prepare for the average and hope for the best.
They prepared for the worst (peak 100k users). The "hope" that those 100k would be spread out was based on the statistical likelihood that these 100k wouldn't line up too much over a 5 minute period.
I didn't follow /r/place that much, but I haven't read any complaints about latency or failures so it looks like they did just fine.
They say elsewhere that they were prepared for 100k updates all at once. Which is the worst.
Edit: you're an sre, you probably have more experience planning these things than I ever will. I just can't help but think about what would happen if some trolls realized they could synchronize their updates and bring down the service. (Although based on the infrastructure it doesn't seem possible to cause lasting damage.)
Yeah, I understand what you're saying. It is conceivable that a bunch of people would coordinate their updates. To prevent new-comers from spoiling the fun, they only allowed users with accounts created before /r/place was launched to send updates.
> It only takes a couple of outliers to bring everything down. I'm not exactly well-versed in defining specs for large scale backend apps (not a back-end engineer) but it seems to me that preparing for the average would not be a wise decision?
It's not an average: Reddit controlled the 5 minute user cool-down period, a.k.a. request throttling. 333 updates/s was the capacity. As alluded to in the article, the 5 minutes was dynamically configurable: if more than 100,000 had users showed up, they would have increased the cool-down period to a value that would yield 333 updates per second at most.
With 100K users and a minimum cooldown time of 5 minutes, there's 100K tiles per 5 minutes would be an upper limit. Only bots would hit that 5 minute cooldown every single time, and bots make up a very small minority of the users. They even point out later in the article that they only peaked at 200 updates per second.
> For example, designing with an average of a million requests per day in mind would probably fail, since you get most of that traffic during daytime and far more less at the nightly hours.
That depends. If you're talking about a public website, then yes. But this is just a single API endpoint with a very fixed time-based rate limit.
I can shed that 333 [something/s], for any uncomplicated something, is so little that a single-core 10YO laptop running a single-process non-blocking webserver (a-la node) could probably handle it, and likely x10 it.
Sure in a perfect scenario with a well behaving load test client. In production you often encounter scenarios which place unexpected load on your servers.
Ah yes, the self-inflicted unintentional DDoS attack that doesn't appear when you do a semi-idealized load test. I may or may not have been responsible for one of those once or twice in my life.
Is that hyperbole or are you really experiencing that much downtime of Reddit? I see it occasionally but it's never down for long, the odd "servers are busy" message usually disappears after a single refresh.
Pages take a long time to generate all day, but during peak hours they take a minimum of 4 seconds each (depends on what page you're loading, if it's got lots of comments, etc), and many times they simply timeout. The engineers at reddit have been unable thus far to fix it.
I'm not the same guy you were just talking to but in my experience on mobile Safari, I get "something went wrong, visit the homepage" when I navigate around reddit far too often. What's funny is it actually tends to resolve itself if I wait a second or two, but yeah, reddit engineering and infrastructure is not well equipped to handle the amount of traffic they receive. It's gotten better, but it's still not what you'd expect from the internet's front page.
The main website is OK as is i.reddit.com but their mobile client is hilariously bad in a "I can't believe they thought this was ready" way, it takes an age to load, it stalls out all the time, the tap targets are too small, it has the classic "oh you hit this when you meant that and then clicked that, lets spin the wheel on where you really end up" problem that slow mobile apps have.
I know they deal with insane scale yadayada but it's simply not ready.
That and the horrific dark pattern on the "We want you to have the best, massive red/orange button marked continue that takes you to the app store and the tiny weeny little "continue to mobile site" underneath".
I don't want your damn app, stop asking me.
If I was cynical I'd think they didn't care about the mobile site been awful as it drives people to the app.
I can't agree with you. The sheer fact that I know well what the reddit error page is refutes this. In fact it's one of the few sites I frequent that I know even have an error page.
It probably is regional, yes, because I simply can't believe every time I write a comment like this here or somewhere else I'm told reddit works fine. It gets on my nerves every single time I visit at night. (Western Europe)
I almost never have a problem on desktop, but using iOS safari (and the mobile reddit site), I often see the error page. I think it's a view/rendering issue, and ends up being displayed rather than a loading modal, but I could be wrong.
maybe three years ago it did, but reddit has gotten drastically more stable since then. It still has the occasional downtime, but now it's more like every couple months than every couple days.
So I got hit by an unfortunate bug on the first day of /r/place.
I was trying to draw something, one pixel at a time, and all of a sudden, after a bunch of pixels, it stopped rate-limiting me! I could place as many as I wanted! So I just figured that they periodically gave people short bursts where they can do anything. This was backed up by my boss, who was also playing with /r/place, saying that the same thing happened to him not long before that (yes, my whole team at work was preoccupied with /r/place that Friday). So I quickly rushed to finish my drawing.
And then I reloaded my browser... and it wasn't there. Turns out that what I thought was a short burst of no rate limiting was just my client totally desyncing from Reddit's servers. Nothing was submitted at all.
Not too long after that, another guy on my team got hit by the same bug. But I told him what happened with me, so he didn't get his hopes up.
It happened to me as well. I did verify that my changes actually made change (from the same IP, but in incognito mode). Didn't bother to check if the changes stayed.
This is fantastic. I learned a lot, and it seems like they nailed everything.
I really enjoyed the part about TypedArray and ArrayBuffer. And this might be a common thing to do, but I've never thought about using a CDN with an expiry time of 1 second, just to buffer lots of requests while still being close to real-time. That's brilliant.
Given the scale described, it sounds like they could have had a single machine that held the data in memory and periodically flushed to disk/DB to support failing over to a standby.
Did you even bother reading the first few paragraphs? They talked about their usage of Redis for this. Next time please read the article before replying.
I would be slightly more careful and just use a cluster of servers with a simple consensus algorithm (like raft). A simple C++ server with a raft library plus uWebSockets should be able to handle a lot of load.
Our initial approach was to store the full board in a single row in Cassandra and each request for the full board would read that entire row.
This is the epitome of an anti-pattern .I sincerely hope that this approach was floated by somebody who had never used Cassandra before.
Even if individual requests were reasonably fast, you are sticking all of your data in a single partition, creating the hottest of hot spots and failing to leverage the scale out nature of your database.
This entire project is just an elaborate hack day project. There's no reason to fault them for trying new and interesting hacks to get it off the ground. They realized it wasn't the right method and moved on. End of story.
r/Place is really awesome. This is how you grow the community. The 2D and 3D timelapses are super cool to watch, as well. Glad Reddit decided to make this a full-time thing.
This is amazing and I got so many ideas on how to tackle the scaling issues I have with my own multiplayer drawing website. In the aftermath of r/Place I went into some of the factions' Discord servers and posted my site, getting 50-100 concurrent users which caused a meltdown on my server. It was a good stress test but also a wake-up call.
That data was recorded by the community. It took everyone a while to figure out how to grab a snapshot of the canvas though, so a few hours of data near the start are missing.
Weird question - why does that bash script use absolute paths for standard tools like awk and grep (/usr/bin/awk instead of just awk)? Is this some best practice I know nothing about?
Anyone with any insight into how much something like this 'cost' Reddit, resource wise. Is the main outlay in time and the server costs already covered by their infrastructure or does the high traffic add enough to make a difference?
This year? I'm thinking more like this decade. It's gotta be up in the top 10 of ever. On so many levels, /r/place was fascinating; and I didn't even come across it until after it had finished!
It took a while for members of the community to realize what was happening and start recording snapshots of the canvas, so there are a few time periods early on that got skipped
This was such an amazing demonstration of human collective collaboration. It sort of makes me fee like humans could do anything, even tho the result is sort of in some sense trivial. As well as simply being enjoyed, this could be studied in so many ways. Competition of memes and cultural representations. 'Evolutionary' convergence upon some optimum. Mainstream vs Fringe. Accepted vs Taboo concepts. Implicit spontaneous emergence of behaviour norms for participants: self regulating systems. I also like this timelapse which contains an overview, and then a close up of each of 12 sections ( 333 x 250 ) - https://www.youtube.com/watch?v=RCAsY8kjE3w
> We actually had a race condition here that allowed users to place multiple tiles at once. There was no locking around the steps 1-3 so simultaneous tile draw attempts could all pass the check at step 1 and then draw multiple tiles at step 2.
This is why you use a proper database.
I'd probably add a Postgres table to record all user activity, and use that to lock out users for 5 minutes as an initial filter. Have triggers on updates to then feed the rest of the application.
So in that case, each pixel would be stored as a separate row in a relational database? And to query the whole canvas you'd query a million rows on every read?
I lean towards just using the ratelimiting stuff we already have in place (via memcached, which we talked about in a previous post). We just overlooked it.
I'd most likely have two tables - one for user activity and one for each pixel (1 million rows only in that table). Selecting a million rows from that pixel table might be 200ms or whatever. I'd still have Redis cache, though, since you're getting 100ms.
Consider exactly what you are proposing. One table to store the entire history (one billion or more rows). A second denormalized table, whether updated at the application layer or via triggers, to store the most recent update to each of the one million cells (1000x1000 pixel grid = one million data points).
The simple fact of introducing a one-million-row read for the latest data of each "pixel cell" is fairly insane. You must have a cache for such data. "I'd still have have Redis cache, though" is not even debatable. It doesn't have to be Redis, but is definitely has to be a cache of one kind or another.
So, I just did a SELECT * from a table with 1 million single-byte character rows, and it ran in 90.51ms:
place=> explain analyze select * from board_bitmap ;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Seq Scan on board_bitmap (cost=0.00..14425.00 rows=1000000 width=6) (actual time=0.009..57.295 rows=1000000 loops=1)
Planning time: 0.160 ms
Execution time: 90.510 ms
(3 rows)
And, with triggers from an activity table, the entire write operation can be atomized so there aren't any race conditions.
I don't think you understand how fast Postgres is on modern hardware. What took a large cluster 5 years ago can be done on a single system with a fast NVMe drive today. We really might not even need Redis in this situation.
And, yes, I have to deal with viral content, so this is right up my alley.
At reddit it's much easier for us to stand up a new Cassandra column family than a new postgres table (not saying this is how it should be, but just how it is).
All we needed to do here was add some simple locking and we would have been fine.
Your parent commenter seems to have no idea as to the true scale you planned for. Most of the criticism I've read here on HN and on Reddit threads regarding your implementation seems to have come from people who have never had to code something that has real-world scaling requirements. This wasn't some pet project initially launched to 100 concurrent users, with the ability to slowly and incrementally scale to millions of users over a period of weeks or months. You had one shot to get it right. A majority of those criticizing would have crashed their entire production stack upon deploying. Hundreds, possibly thousands, of queries per second returning one million rows each? Not going to happen, no matter which database backend you choose. The foresight you had to get it right the first time was well played on your part.
Ideally, you would have also used redis to limit the per-user activity without having to hit Cassandra. Also not sure why you hit Cassandra instead of redis for the single-pixel fetch endpoint (redis GETBIT operation rather than a database hit); if you already conceded to not-quite-atomic operations across the entire map, a GETBIT would have rarely returned a stale data point. But these are minor nice-to-have criticisms that would have pushed the scaling capabilities even further beyond your expected requirements. All in all, again I highly commend your results. You had one minor snafu, and managed to overcome it. Well done!
Aside: my brain is spinning as to how I would provide a 100% guaranteed atomic version of /r/place - without any point of failure such as a redis server not failing/restarting, or a single-server in-memory nodejs data structure. Really tough to do so without any point of failure or concession to atomicity. :)
Second aside: more than anything, I am surprised you have a CDN that allows 1-second expiries. While perfect for this kind of project, too many CDNs find a 1-second expiry as a risk to permit, as they tend to expect too much abuse/churn. ie: How is a CDN supposed to trust you enough to use a 1-second expiry for reasonably high traffic, rather than cycling so much caching effort for something that could have used a 5 minute expiration? I can't imagine being the developer of a CDN that trusts its users to use a 1-second expiry that wastes an insane number of CPU cycles for an origin that is not legitimately sustainable.
tldr (still long, but on point): You guys did an amazing job for something that lasted, what was it, 3 days? Great job! Many of your critical audience members would not have managed any better, let alone being viable and functional. I would submit my résumé to work for you, but I fear my personality is far too... um... abrasive... to get along with the organisation as a whole. In any case, your team as a cohesive unit - design, backend, and frontend (especially the mobile support) - did an incredible job. +1 to the Reddit team here, you should be immensely proud of yourselves for pulling this off.
We don't have a normal CDN, we have Fastly. They are really incredible at what they do, and this would not have been possible with our previous CDN partners.
Easily doable. On a 10$/month server I frequently run queries doing text operations over 120 million rows for fulltext search of an IRC client backlog.
In 64ms.
Without caching.
Using PHP.
It's definitely doable, but you'll need to heavily fine-tune your queries. My first one was at over 2 hours for the same.
> It's definitely doable, but you'll need to heavily fine-tune your queries
Misrepresentation; then it's not actually over 120 million rows. You're basically encoding which subset to actually search in the query, rather than building a proper overall schema that trivializes queries.
With all due disrespect, you're wrong. Go ahead and implement your solution, and you will find it falls apart. So tired of people pretending to know better, without any data or real details to back it up. A "proper database" would not scale, regardless of whether it is Cassandra or Postgres.
You're completely ignoring, or completely oblivious of the fact, that the entire 1000x1000 grid must be provided to every connected client. You're not going to read out one million aggregated rows by most recent timestamp per cell, from a billion rows of history, in a scalable amount of time.
Please post your GitHub link that proves your solution as superior, or even viable. Make sure it includes database triggers, for which you don't explain how they would help scale the app whatsoever. Are you going to have a denormalized table containing each of the one million cells' most recent rows? All you are doing is eliminating a GROUP BY on the indexed cell+timestamp columns. It's still a million rows returned per query. Please explain how that scales. Eagerly awaiting your proven solution that defies common sense scaling logic.
Or just use redis for everything :) One instance for the bitfield, one for atomic locks (done with a lua script) and one for tile data (with a few slaves for reads). Simple and independently scaleable.
Database choice aside, I'm shocked that this wasn't noticed when designing the app... It's clearly not a design decision because they refer to it as an error later in the paragraph.
It gets weird when the cursor leaves the box while dragging. Now, when you go back inside, you're still in drag mode since the box did not get the "mouse up" event and you end up selecting and dragging random text.
In the end this does not seem to have mattered. Reddit's hardcore "contributors" are the kind of people who enjoy a challenge, even when it's stupid. I think it even turned into some kind of pride for some of them, being able to "master" an idiotically programmed system.
Myself, I just get so frustrated about the idiocy.
This is awesome but man, reading the canvas portion was a bit distressing. I wonder why they didn't use a game engine to do this? All the work they did has been implemented already in several JS game engines, such as the one I help maintain (it's free and OSS), https://excaliburjs.com. We support all the features they needed including mobile & touch support. They could have also used Phaser (http://phaser.io) too I bet... that has WebGL support for even faster rendering on supported devices.
Hi, I wrote the majority of that part of the project (canvas stuff) & that section of the article – the simple answer is that I have a lot of experience working with the canvas API directly, but little to no experience using any of the popular JS game engines out there (I played around with Phaser years ago, but not very much). I don't think it would've saved me any time to be honest.
...it's not like you have write assembly to get it done, the native canvas API is fairly straightforward. A game engine is a bit of an overkill if all you want to do is place pixels on a canvas.
But they wanted a lot more than that. Engines like Phaser work hard to take care of browser quirks for things like PointerEvents vs. TouchEvents vs. MouseEvents or supporting mobile devices. Sure, it seems simple at first until you run into those kinds of problems and reinvent the wheel... learning an engine isn't terribly complicated but I understand the sentiment for a one-time project. It just seems like they did so much other planning but didn't want to plan the UI implementation to the same degree?
Like you said, it was a one-time project that wasn't incredibly complex and just needed a one time deployment. And they had UI guys who generally knew what they wanted to and how to do it. I think it would probably have taken them more time to research suitable engines they could rely on than to build the functionality themselves (or use whatever libraries they were already very familiar with.) All engines/libraries end up having quirks that you really only learn through experience.
Also they were kind of time constrained. WebGL would have been definitely more performant but it has all sorts of hardware issues at times in different devices. I've done plenty of WebGL over the years and seen random GPU crashes where you have to restart the entire machine to unlock your self out.
The canvas trick with typed arrays is brilliant. Using requestAnimationFrame is what any front end dev who knows perf would do. Its like the front end version of cdn with 1 second time out trick.
Also using a layer of library whose code you don't understand in a perf critical application is quite risky. Its better to stick closer to the native browser APIs which you are familiar with. We once had to throw away a library and rewrite code from scratch because their assumptions failed when pushed to the limits. The rewritten code was 100x smaller in size and 10x more performant.
I would have loved to work on something like this but it sucks reddit doesn't have any presence in Seattle.
What is the learning curve on a game engine for not game engine developers?
They used tools they knew and knew how to scale. Almost always the tool you know is better than the perfect tool. They know redis and websockets, and they made it work. Beats using some engine know one on staff has ever touched?
I guess it depends on the engine, or maybe that was a rhetorical question. It doesn't take a month, or even a week, though--maybe to master them but not to learn the API. We've worked hard on our little project to design our engine for non-game developers or veterans, it's only a few lines of code to move a square, or draw an image, for example. But I understand the sentiment, especially for a one-time project.
I don't know, they had several well defined constraints and you don't know what a library is doing under the covers without spending some time with it. A proof of concept engine exploration is different than an app with 100,000 simultaneous users. I wonder how many business days they spent on this thing.