MMO xkcd WebSockets fixed by PubNub

pubnub · on Sept 27, 2012

Hello all hackers from HackerNews. We notice a new MMO was released by n01se and xkcd yesterday (September 26th, 2012) with multiple users flying around with a balloon figure. If you got stuck, you can click your balloon guy and turn into a ghost to seamlessly move through the landscape unhindered by mortal barriers like trees and hills. There was a problem however with the scaling of users on the system. The max concurrency could only be 20 users at a time leaving many wonder where the MMO part of the MMO was. We ripped out the non-scaling Node.JS code. ENJOY.

Enter the xkcd World with Friends: http://www.pubnub.com/static/pubnub-xkcd/index.html

pubnub · on Sept 27, 2012

Full Blog Post: http://blog.pubnub.com/xkcd-mmo-websocket-scaling-with-pubnu...

chrishouser · on Sept 27, 2012

Any reason you copied the files manually instead of forking on github? The link to https://github.com/n01se/1110 was right there on the main page.

pubnub · on Sept 27, 2012

Hi Chris! We ment to fork it. Just forgot. We included the link on our README.md, Game Page itself and the blog.

stagas · on Sept 27, 2012

I don't think it's fair implying that Node.js is to blame for the scaling issues of this particular project. Dead reckoning, variable polling and other tricks could have been used with a Node.js server as well and make it look smooth and play decently, they just weren't there in this implementation.

pubnub · on Sept 27, 2012

Stagas, you are correct. Thank you for mentioning this. Node is not to blame here as you say. However MMOs require orchestration and server parallelization expertise in order to properly scale an acceptable user experience.

stagas · on Sept 27, 2012

I have no doubts your system is a very powerful and scalable solution for pubsub, but I believe just the fact that it's not utilizing Websockets, the overhead is not acceptable for MMOs or other latency sensitive tasks. So you shouldn't advertise as such, people might actually believe you and try and build one. It'll blow up in their face when they realize they can't get the latency down to an acceptable level due to the 200-300 extra bytes per request and the delay of constantly opening new connections.

Retric · on Sept 27, 2012

There are a fair number of browser based strategy games such as evony.com where latency is not really an issue. It's mostly a question of how to handle communication between back end servers and 'world' chat that's the real scaling issue.

stagas · on Sept 28, 2012

It should be more than adequate for strategy or turn based games. Perhaps I should have been more clear that I was talking about latency sensitive MMOs, like shooters or anything where your position and movement in time are critical to the gameplay.

Evbn · on Sept 29, 2012

OP is a bombastic ad for pubnub, blowing up in other people's faces isn't their worry.

subsystem · on Sept 27, 2012

It's a good hack and all, but you didn't actually solve the real problems. Most of which ahould be solved with server side logic[0], something I'm guessing your service can't do?

[0] It could theoretically be done p2p.

a1k0n · on Sept 28, 2012

How would you do it p2p? WebSockets don't provide UDP, so NAT traversal is potentally impossible.

subsystem · on Sept 28, 2012

I guess it would be more as a concept than technically. That is, you could send to a specific peer through the server. That way you could have only the messaging logic on the server and the game logic on the clients.

The clients would then (again theoretically) be able to agree on a game state, optimize messaging and the other things you would normally do on the server.

kanaka · on Sept 28, 2012

WebRTC has a data channel proposal which would allow you to do P2P data communication:

http://tools.ietf.org/html/draft-jesup-rtcweb-data-protocol-...

jtxt · on Sept 28, 2012

Neat and your service looks interesting, but I don't call this 'fixed' yet. It did not seem to scale well.

Everyone is spawning at the same point. I just see a flood of dead users. Some jerk around a little and appear and disappear. But very little interaction.

Is it possible to multiple spawn points when demand is high and separate channels/servers for various areas that you switch as you move? (like Second Life)

I believe a node.js (or something else) version could be federated, and/or clients connect to various servers as they travel.

I know this is just a toy, but it would be interesting to see this work well at a large scale.

pubnub · on Sept 28, 2012

Hi JTxt. Good points regarding the fix. In fact the system is scaling great and you can see a lot of people on the screen. The issue is in drawing that number of players on the client computer. We have improved the source code just a moment ago and now you can see disconnected users in a stopped state and live users are actively moving around. It was wonderful to see this earlier this morning with about 2K active live users all in the same world. Updates have been posted. Enjoy.

kanaka · on Sept 28, 2012

Actually, I've done some analysis of the stream of messages and it appears the current problem is that the vast majority of messages never arrive (even as of 7:30am CST). You can verify this easily by connecting with two browser windows. The first time I tried it took almost 30 seconds before my second tab even saw my other character even though I was moving both around. And the name in the second tab was never updated even after a minute.

The second time I tried, my first player window saw the other player fairly quickly, but it never registered the change to a ghost or my name change and it just showed the second player as a balloon guy floating up forever (even after the second client window disconnected).

I did this testing at 7:30am CST. There were about 90 other players that I was getting updates from. However, this is the same type of behavior I've seen whenever I have connected since it went live. And friends in other locations see the same behavior so it's not just my environment.

kanaka · on Sept 27, 2012

I think your service is having some trouble. The updates appear fairly smooth because the client continues rendering the last vector seen for each avatar. However, when comparing with several friends also connected it's clear that very few of the messages are getting through and sometimes only in one direction.

Also the dynamic poll/sample interval you implemented seems to hit 1000ms (1 second) and stay there.

pubnub · on Sept 27, 2012

Hi Kanaka! Good question regarding sample rate. We auto-scaled the sample rate based on number of occupants. It is currently at 1000ms top peak because there are so many of you! If there were only a few, it would go to 50ms.

kanaka · on Sept 27, 2012

Yeah, we actually made the decision to cap the number of users so that the connected users would get a good experience with low latency and high interactivity. We figured that people who really wanted to see it would come back (or look at the video).

Oh, and also since I'm personally paying for the bandwidth and my wife would be unhappy with a big bill at the end of the month paying for a game for other people :-)

jtxt · on Sept 27, 2012

Is it possible to divide the world into multiple channels?

So I only see updates from those in my area. Then change channels as you move?

And start users in different spawn points when usage is high?

pubnub · on Sept 28, 2012

Hi JTxt! Yes this is possible! Just update the channel value in this line here from "/xkcd" to "/whatever_you_want" and you can take on new parallel worlds. The source code is here that you need to change: https://github.com/pubnub/pubnub-xkcd/blob/master/network.js...

jtxt · on Sept 28, 2012

Thanks for responding!

I'm talking about dividing the map into multiple channels so that users only receives updates from those around them, and hopefully make it more responsive.

So for 1000x1000 areas, if I'm at "x":3033521.3,"y":-3025356.4, my client subscribes to /xkcd_x3033_y-3025

And adjoining area/channels if they're close to a border. (but only sends updates to the current area/channel.)

...IF it's easy enough for the client to dynamically change channels, and be subscribed to multiple (up to 4) channels.

Then combine this with multiple spawn locations at other interesting locations when demand is high. But they can still explore and meet others.

So it should more responsive and less data sent than everyone getting updates from everyone.

pubnub · on Sept 28, 2012

Hi JTxt, This is a fantastic idea for resource saving. Also streaming all the data to the clients, while possible with PubNub slows down the clients as they are busy downloading the streams from all player movements on the screen. Would you be able to work with us and coordinate a way to work this? Here is the starting line to segregate channel data - https://github.com/pubnub/pubnub-xkcd/blob/master/network.js... see the "xkcd2" at the end. This is the CHANNEL ID. You can take that and change it based on the Region of the world. You can even use the names of the world splices such as http://www.pubnub.com/static/pubnub-xkcd/images/1n1e.png - 1n1e "1 north 1 east". So the channel name would be "1n1e".

jtxt · on Sept 28, 2012

Sure, I'll try to play with it this weekend. I'll have some questions for you if I can work on this. Thanks!

leggetter · on Sept 28, 2012

It definitely makes sense to split the world up into multiple channels. That's what the guys who built http://wordsquared.com/ did. As you scroll you unsubscribe from channels that are out of view and subscribe to channels that are moving into view. You could think of it in a similar way to how Google Maps tiling works.

ch0wn · on Sept 27, 2012

Great use of a marketing opportunity. This actually runs very smoothly. Well done!

donpark · on Sept 27, 2012

Agreed. It would be nice to see running charts or stats showing # of users online, average messages/second, and average $/hr at $1 per 1M rate.

kanaka · on Sept 27, 2012

For our original server, after the first full day after the HN post, we racked up an AWS bill of about $1.62. Most of that was in the first few minutes after the announcement before we implemented the user cap.

From earlier today for about 20 hours of run time: we had 13,000 successful connections to the server, 22,000 failed attempts to connect. The server is averaging about 100k per second outbound when fully loaded with 20 clients. With the quick improvements we made after the HN flood we got the CPU down to about 1-5% on a t1.micro when 20 clients are connected and interacting. The real issue for us is the bandwidth.

The server and protocol could be MUCH more efficient (visibility pruning, binary messages, etc). But it was something chouser and I hacked up over the weekend and it's now working quite well within the cap.

donpark · on Sept 28, 2012

Your stats are inline with what I expected. Thanks.

What I was actually interested in was PubNub stats to see how such service can help with apps like yours and to help in figuring out how much it'll cost.

joshaidan · on Sept 27, 2012

I like this a lot. But I wish it had some physics added to it. Maybe that will be my fork.

pubnub · on Sept 27, 2012

Added iPad and iPhone / Android Support with Touch! - http://www.pubnub.com/static/pubnub-xkcd/index.html

meritt · on Sept 27, 2012

I'm running Chrome yet it's still long-polling instead of using websockets.

Why?

pubnub · on Sept 27, 2012

We fixed using a reliable transport communication which Bundles, Compresses and Delivers the data more efficiently. Check it out! https://github.com/pubnub/pubnub-api/tree/master/websocket - PubNub WebSocket Emulation PubNub offers full RFC 6455 Support for WebSocket Client Specification. PubNub WebSockets enables any browser (modern or not) to support the HTML5 WebSocket standard APIs. Use the WebSocket Client Directly in your Browser that Now you can use new WebSocket anywhere!

meritt · on Sept 27, 2012

While emulating WebSocket on non-supporting browsers is definitely awesome, the point I am making, is I am using a websocket-supported browser yet it's just doing a series of non-stop HTTP requests.

Shouldn't the emulation only occur when necessary and be a graceful degradation not forcing all clients to downgrade to a less efficient transport?

danielweber · on Sept 27, 2012

Okay, I could probably answer this if I read all the old WebSockets posts, but how can you have people communicate without a server?

elisee · on Sept 27, 2012

From what I understand, PubNub provides servers around the world who take care of the heavy lifting and all you have to do is deal with publishing / subscribing messages.

So there are servers, you just don't run any yourself and it is all abstracted away by their API.

What they did, in this case, is mock the WebSocket interface (so that they didn't need to rewrite the whole original app) and make it use the PubNub API internally. See https://github.com/pubnub/pubnub-xkcd/blob/master/websocket....

pubnub · on Sept 27, 2012

Excellent investigation! You are correct thank you for the explanation here on this thread. Check it out we also needed to remove the server.js file too. No more node needed now. Enjoy!

kanaka · on Sept 27, 2012

There are servers, you just don't have to run your own. PubNub is a commercial company with a messaging platform that they have integrated this with.

mparlane · on Sept 27, 2012

"No server required courtesy of PubNub."

There most definitely is a server, so can anyone explain what they meant by that instead?

kanaka · on Sept 27, 2012

They mean that you don't have to run a server yourself, instead you use their PubNub messaging service (which has a free tier but costs after that).

pubnub · on Sept 27, 2012

https://twitter.com/PubNub/status/251440715572326400/photo/1 - Team Photo [IMG]

pubnub · on Sept 27, 2012

Recorded a video - http://vimeo.com/50320757 - [VIDEO] of live action during the early moments of the release.

jtxt · on Sept 27, 2012

Nice work. It's an interesting problem but I don't think it's solved yet.

In the video and from my experience I see a bunch of dead users spawning from a single point making a 'pillar of death' and a few users moving that are skipping around, but none in any kind of fluid motion.

Perhaps have multiple spawns, offset by a small random amount.

Not sure how your system works but perhaps have areas in separate channels, and users only subscribe to updates from that area?

I'd rather see a few users in my area that can interact with me quickly than a flood of non-responsive ones.

pubnub · on Sept 28, 2012

JTxt, great idea! We took your idea and implemented a random user spawning location. Thank you for the idea!

kanaka · on Sept 28, 2012

JTxt. We had a similar problem (but for a different reasons) with the original (1110.n01se.net) when the initial rush from HN happened. After a few minutes we moved the server to an AWS t1.micro instance (our previous hosting company didn't like CPU intensive processes and was killing the server).

However, one of the problems with a t1.micro server on AWS is that the hypervisor will detect heavy load and start throttling the VM using a heavy handed approach which basically just pauses the VM for several seconds. This would cause buffers to build up and when the VM started running again you would have a burst of traffic.

The symptom was that you would see other avatars stop moving for a few seconds and then suddenly jump around and then be back to smooth. All clients would still receive all the updates (WebSockets is reliable over TCP), but they would come in bursts whenever the VM was throttled. So a few minutes after moving to the AWS server we realized what was happening (we use AWS for development and knew to recognize the behavior) so we implemented a few easy fixes to the server (such as change deltas and doing JSON of the data once instead of per client, etc) to bring the CPU usage down to a reasonable level so that it wouldn't get throttled. Combined with the 20 connection cap that resulted in a very smooth and low latency experience for the players that were able to connect.

One thing we are considering with the original is what you suggested where players only get updates for things that are visible to them. However, this means more processing on the server because each client gets a different data set which has to be generated. So it's sort of a tradeoff between decreasing bandwidth and increasing CPU (surprisingly often the case in the real world). If the CPU increase causes throttling then we would start encountering burstiness again. So it might be worth it or it might not.

Anyways, if we have some free time we may play with some ideas. One of the first things is being able to simulate load manually. Doing a mad dash to try and implement improvements while thousands of HN users are trying to connect (we didn't know that a friend had posted it there) is certainly an adrenaline rush but not the most ideal way to do development :-)

Another solution would be to use a larger AWS instance. Right now on the t1.micro with 20 users the CPU stays around 1-5% (which is a safe zone for throttling). Larger instances don't have the throttling problem so in addition to more horse power we would be able to use a much higher percentage of it. However, it's just a spare time project for us and I'm still trying to figure out to explain to the wife why we are paying (mostly bandwidth) for other people to play an online game. :-)

And surprisingly AWS doesn't seem to have a way to accept donations or gift cards for AWS costs. Seems like a logical for encouraging free and open source development projects like this.

jtxt · on Sept 28, 2012

Thanks for getting this going and sharing your experience. It's really fun when it's interactive.

I'd like to hope having 1000's in the same map with fast updates, without being a huge burden to any one person is possible.

Perhaps run a master instance at 1110.n01se.net Invite others to run a slave instance that is configured to connect to your master, report health, and serve areas as directed.

The master serves the client and assigns a spawn point to the client to other interesting locations when load is high. It also manages the list of slaves and the areas covered and updates it with clients as it changes.

(But there's a potential bottleneck here. Not sure how to do this yet...

Also how do you subdivide the map so high traffic areas are smaller? I'm guessing Quadtrees so the subdivide data can be very small.

How well can clients be connected to multiple servers?

And is a master/slave relationship needed?)

Well I'll leave it there. It's an interesting problem and I think it's part of what node.js is trying to become.

vyrotek · on Sept 27, 2012

Cheers! We're using PubNub in production and it's been great so far. We were one of those noisy customers that wanted the presence API. :)

carimura · on Sept 27, 2012

Nice work guys.

pubnub · on Sept 27, 2012

Thank you Carimura! Enjoy.

wschott · on Sept 27, 2012

Awesome!!!

dirkk0 · on Sept 27, 2012