Hello all hackers from HackerNews. We notice a new MMO was released by n01se and xkcd yesterday (September 26th, 2012) with multiple users flying around with a balloon figure. If you got stuck, you can click your balloon guy and turn into a ghost to seamlessly move through the landscape unhindered by mortal barriers like trees and hills. There was a problem however with the scaling of users on the system. The max concurrency could only be 20 users at a time leaving many wonder where the MMO part of the MMO was. We ripped out the non-scaling Node.JS code. ENJOY.
I don't think it's fair implying that Node.js is to blame for the scaling issues of this particular project. Dead reckoning, variable polling and other tricks could have been used with a Node.js server as well and make it look smooth and play decently, they just weren't there in this implementation.
Stagas, you are correct. Thank you for mentioning this. Node is not to blame here as you say. However MMOs require orchestration and server parallelization expertise in order to properly scale an acceptable user experience.
I have no doubts your system is a very powerful and scalable solution for pubsub, but I believe just the fact that it's not utilizing Websockets, the overhead is not acceptable for MMOs or other latency sensitive tasks. So you shouldn't advertise as such, people might actually believe you and try and build one. It'll blow up in their face when they realize they can't get the latency down to an acceptable level due to the 200-300 extra bytes per request and the delay of constantly opening new connections.
There are a fair number of browser based strategy games such as evony.com where latency is not really an issue. It's mostly a question of how to handle communication between back end servers and 'world' chat that's the real scaling issue.
It should be more than adequate for strategy or turn based games. Perhaps I should have been more clear that I was talking about latency sensitive MMOs, like shooters or anything where your position and movement in time are critical to the gameplay.
It's a good hack and all, but you didn't actually solve the real problems. Most of which ahould be solved with server side logic[0], something I'm guessing your service can't do?
I guess it would be more as a concept than technically. That is, you could send to a specific peer through the server. That way you could have only the messaging logic on the server and the game logic on the clients.
The clients would then (again theoretically) be able to agree on a game state, optimize messaging and the other things you would normally do on the server.
Neat and your service looks interesting, but I don't call this 'fixed' yet. It did not seem to scale well.
Everyone is spawning at the same point. I just see a flood of dead users. Some jerk around a little and appear and disappear. But very little interaction.
Is it possible to multiple spawn points when demand is high and separate channels/servers for various areas that you switch as you move? (like Second Life)
I believe a node.js (or something else) version could be federated, and/or clients connect to various servers as they travel.
I know this is just a toy, but it would be interesting to see this work well at a large scale.
Hi JTxt. Good points regarding the fix. In fact the system is scaling great and you can see a lot of people on the screen. The issue is in drawing that number of players on the client computer. We have improved the source code just a moment ago and now you can see disconnected users in a stopped state and live users are actively moving around. It was wonderful to see this earlier this morning with about 2K active live users all in the same world. Updates have been posted. Enjoy.
Actually, I've done some analysis of the stream of messages and it appears the current problem is that the vast majority of messages never arrive (even as of 7:30am CST). You can verify this easily by connecting with two browser windows. The first time I tried it took almost 30 seconds before my second tab even saw my other character even though I was moving both around. And the name in the second tab was never updated even after a minute.
The second time I tried, my first player window saw the other player fairly quickly, but it never registered the change to a ghost or my name change and it just showed the second player as a balloon guy floating up forever (even after the second client window disconnected).
I did this testing at 7:30am CST. There were about 90 other players that I was getting updates from. However, this is the same type of behavior I've seen whenever I have connected since it went live. And friends in other locations see the same behavior so it's not just my environment.
I think your service is having some trouble. The updates appear fairly smooth because the client continues rendering the last vector seen for each avatar. However, when comparing with several friends also connected it's clear that very few of the messages are getting through and sometimes only in one direction.
Also the dynamic poll/sample interval you implemented seems to hit 1000ms (1 second) and stay there.
Hi Kanaka! Good question regarding sample rate. We auto-scaled the sample rate based on number of occupants. It is currently at 1000ms top peak because there are so many of you! If there were only a few, it would go to 50ms.
Yeah, we actually made the decision to cap the number of users so that the connected users would get a good experience with low latency and high interactivity. We figured that people who really wanted to see it would come back (or look at the video).
Oh, and also since I'm personally paying for the bandwidth and my wife would be unhappy with a big bill at the end of the month paying for a game for other people :-)
Hi JTxt! Yes this is possible! Just update the channel value in this line here from "/xkcd" to "/whatever_you_want" and you can take on new parallel worlds. The source code is here that you need to change: https://github.com/pubnub/pubnub-xkcd/blob/master/network.js...
I'm talking about dividing the map into multiple channels so that users only receives updates from those around them, and hopefully make it more responsive.
So for 1000x1000 areas, if I'm at "x":3033521.3,"y":-3025356.4, my client subscribes to /xkcd_x3033_y-3025
And adjoining area/channels if they're close to a border. (but only sends updates to the current area/channel.)
...IF it's easy enough for the client to dynamically change channels, and be subscribed to multiple (up to 4) channels.
Then combine this with multiple spawn locations at other interesting locations when demand is high. But they can still explore and meet others.
So it should more responsive and less data sent than everyone getting updates from everyone.
Hi JTxt, This is a fantastic idea for resource saving. Also streaming all the data to the clients, while possible with PubNub slows down the clients as they are busy downloading the streams from all player movements on the screen. Would you be able to work with us and coordinate a way to work this? Here is the starting line to segregate channel data - https://github.com/pubnub/pubnub-xkcd/blob/master/network.js... see the "xkcd2" at the end. This is the CHANNEL ID. You can take that and change it based on the Region of the world. You can even use the names of the world splices such as http://www.pubnub.com/static/pubnub-xkcd/images/1n1e.png - 1n1e "1 north 1 east". So the channel name would be "1n1e".
It definitely makes sense to split the world up into multiple channels. That's what the guys who built http://wordsquared.com/ did. As you scroll you unsubscribe from channels that are out of view and subscribe to channels that are moving into view. You could think of it in a similar way to how Google Maps tiling works.
For our original server, after the first full day after the HN post, we racked up an AWS bill of about $1.62. Most of that was in the first few minutes after the announcement before we implemented the user cap.
From earlier today for about 20 hours of run time: we had 13,000 successful connections to the server, 22,000 failed attempts to connect. The server is averaging about 100k per second outbound when fully loaded with 20 clients. With the quick improvements we made after the HN flood we got the CPU down to about 1-5% on a t1.micro when 20 clients are connected and interacting. The real issue for us is the bandwidth.
The server and protocol could be MUCH more efficient (visibility pruning, binary messages, etc). But it was something chouser and I hacked up over the weekend and it's now working quite well within the cap.
Your stats are inline with what I expected. Thanks.
What I was actually interested in was PubNub stats to see how such service can help with apps like yours and to help in figuring out how much it'll cost.
We fixed using a reliable transport communication which Bundles, Compresses and Delivers the data more efficiently. Check it out! https://github.com/pubnub/pubnub-api/tree/master/websocket - PubNub WebSocket Emulation
PubNub offers full RFC 6455 Support for WebSocket Client Specification. PubNub WebSockets enables any browser (modern or not) to support the HTML5 WebSocket standard APIs. Use the WebSocket Client Directly in your Browser that Now you can use new WebSocket anywhere!
While emulating WebSocket on non-supporting browsers is definitely awesome, the point I am making, is I am using a websocket-supported browser yet it's just doing a series of non-stop HTTP requests.
Shouldn't the emulation only occur when necessary and be a graceful degradation not forcing all clients to downgrade to a less efficient transport?
From what I understand, PubNub provides servers around the world who take care of the heavy lifting and all you have to do is deal with publishing / subscribing messages.
So there are servers, you just don't run any yourself and it is all abstracted away by their API.
Excellent investigation! You are correct thank you for the explanation here on this thread. Check it out we also needed to remove the server.js file too. No more node needed now. Enjoy!
Nice work. It's an interesting problem but I don't think it's solved yet.
In the video and from my experience I see a bunch of dead users spawning from a single point making a 'pillar of death' and a few users moving that are skipping around, but none in any kind of fluid motion.
Perhaps have multiple spawns, offset by a small random amount.
Not sure how your system works but perhaps have areas in separate channels, and users only subscribe to updates from that area?
I'd rather see a few users in my area that can interact with me quickly than a flood of non-responsive ones.
JTxt. We had a similar problem (but for a different reasons) with the original (1110.n01se.net) when the initial rush from HN happened. After a few minutes we moved the server to an AWS t1.micro instance (our previous hosting company didn't like CPU intensive processes and was killing the server).
However, one of the problems with a t1.micro server on AWS is that the hypervisor will detect heavy load and start throttling the VM using a heavy handed approach which basically just pauses the VM for several seconds. This would cause buffers to build up and when the VM started running again you would have a burst of traffic.
The symptom was that you would see other avatars stop moving for a few seconds and then suddenly jump around and then be back to smooth. All clients would still receive all the updates (WebSockets is reliable over TCP), but they would come in bursts whenever the VM was throttled. So a few minutes after moving to the AWS server we realized what was happening (we use AWS for development and knew to recognize the behavior) so we implemented a few easy fixes to the server (such as change deltas and doing JSON of the data once instead of per client, etc) to bring the CPU usage down to a reasonable level so that it wouldn't get throttled. Combined with the 20 connection cap that resulted in a very smooth and low latency experience for the players that were able to connect.
One thing we are considering with the original is what you suggested where players only get updates for things that are visible to them. However, this means more processing on the server because each client gets a different data set which has to be generated. So it's sort of a tradeoff between decreasing bandwidth and increasing CPU (surprisingly often the case in the real world). If the CPU increase causes throttling then we would start encountering burstiness again. So it might be worth it or it might not.
Anyways, if we have some free time we may play with some ideas. One of the first things is being able to simulate load manually. Doing a mad dash to try and implement improvements while thousands of HN users are trying to connect (we didn't know that a friend had posted it there) is certainly an adrenaline rush but not the most ideal way to do development :-)
Another solution would be to use a larger AWS instance. Right now on the t1.micro with 20 users the CPU stays around 1-5% (which is a safe zone for throttling). Larger instances don't have the throttling problem so in addition to more horse power we would be able to use a much higher percentage of it. However, it's just a spare time project for us and I'm still trying to figure out to explain to the wife why we are paying (mostly bandwidth) for other people to play an online game. :-)
And surprisingly AWS doesn't seem to have a way to accept donations or gift cards for AWS costs. Seems like a logical for encouraging free and open source development projects like this.
Thanks for getting this going and sharing your experience.
It's really fun when it's interactive.
I'd like to hope having 1000's in the same map with fast updates, without being a huge burden to any one person is possible.
Perhaps run a master instance at 1110.n01se.net
Invite others to run a slave instance that is configured to connect to your master, report health, and serve areas as directed.
The master serves the client and assigns a spawn point to the client to other interesting locations when load is high.
It also manages the list of slaves and the areas covered and updates it with clients as it changes.
(But there's a potential bottleneck here. Not sure how to do this yet...
Also how do you subdivide the map so high traffic areas are smaller? I'm guessing Quadtrees so the subdivide data can be very small.
How well can clients be connected to multiple servers?
And is a master/slave relationship needed?)
Well I'll leave it there. It's an interesting problem and I think it's part of what node.js is trying to become.
Enter the xkcd World with Friends: http://www.pubnub.com/static/pubnub-xkcd/index.html