The Road to 2M Websocket Connections in Phoenix

polskibus · on Nov 3, 2015

Great job Phoenix team! I wonder how much of this wouldn't even be possible if not for the great BEAM platform and Cowboy web server.

Whatsapp famously works with 2m connections on a FreeBSD box (this number is old, I bet they've beaten that number). http://highscalability.com/blog/2014/2/26/the-whatsapp-archi...

I wonder whether these two cases are in any way comparable. Different stacks, different machines, different test. The only similarity being BEAM/Erlang used as a platform. Speaks well of its scalability!

chrismccord · on Nov 3, 2015

None of it would be possible without BEAM and Cowboy :) Cowboy is a huge driver for our great numbers. It handles the tcp stack and websocket connections for us. As I said elsewhere in this thread, we are doing more than just holding idle WS connections open, but we're standing on the shoulders of giants as far as cowboy and erlang are concerned.

bcardarella · on Nov 3, 2015

We've been using Phoenix for several applications for over a year. The channel functionality is so easy to use and implement. High recommend people checking this out: http://www.phoenixframework.org/docs/channels

jsmoov · on Nov 3, 2015

I'd also recommend the Programming Phoenix book, it's still in beta but the chapter on Channels just came out:

https://pragprog.com/book/phoenix/programming-phoenix

chrismccord · on Nov 3, 2015

My ElixirConfEU keynote gave a high-level overview of the channels design for those interested - quick jump to place in presentation: https://youtu.be/u21S_vq5CTw?t=22m34s

I'm also happy to answer any other questions. The team is very excited about our channel and pubsub layers progress over the last year.

toast0 · on Nov 3, 2015

If you listen on multiple ports, you can get way more than 64k connections per client. If the workload is mostly idle, tsung should be able to manage quite a few connections too.

chrismccord · on Nov 3, 2015

We'd like to explore a private network so we can get away with more than 64k per client, as it starts to get expensive spinning up 50 servers whenever we want to test. Unfortunately, the instances that were lent to us came with no way for us to assign additional IPs.

toast0 · on Nov 3, 2015

You don't need multiple IPs, tcp connections are a unique 4-tuple: {ClientIP, ServerIP, ClientPort, ServerPort}. If you add more ServerPorts, you can re-use the client ports. Now, if you want to test past 4 billion connections, you need more IPs. :)

noir_lord · on Nov 3, 2015

> You don't need multiple IPs, tcp connections are a unique 4-tuple: {ClientIP, ServerIP, ClientPort, ServerPort}.

Sometimes someone throws something out so casually and yet you realise the innate sense of what they just said, I literally stopped with coffee halfway to my mouth.

I'd never have thought of doing it that way.

toast0 · on Nov 3, 2015

I may have connection tested a few things in the past ;) If you're benchmarking with tsung, it's nice, because it has hooks to let you pick the client port when making the connection; you'll probably need this, because the OS default assignment algorithms tend to get it wrong with this much stuff going on.

lilyball · on Nov 6, 2015

Wait wait wait, you can re-use client ports for multiple connections? Does this actually make it past NATs? Why does every TCP connection get a unique client port by default if there's no problem with reusing them as long as you're talking to a different {ServerIP,ServerPort} pair?

mtw · on Nov 3, 2015

I'd be interested in benchmarks vs other similar frameworks, for the same use case. These had json/db queries benchmarks however Phoenix lags behind (will be at the top soon?) https://www.techempower.com/benchmarks/previews/round11/

out_of_protocol · on Nov 3, 2015

Take a look at source code. Phoenix in question

1) outdated, version used is really really old (0.8 or so)

2) Database pool set to really low value unlike other participants (~10x lower)

3) Configured to do many things other frameworks skip, like generating csrf token

All these points were fixed already but, most likely, new results will go to Round 12

mtw · on Nov 3, 2015

ok. will see round 12.

But why did you downvote my comment? I was just reporting a fact, a benchmark that was performed by a third-party. It's not like I was voicing a weird opinion or fabricating truth.

out_of_protocol · on Nov 3, 2015

I didn't (nor have such rights nor intention). Also, valid question imo. Btw, tested it [old version] locally with a single change - set pool size to 256 (similar to other frameworks). Result: ~10x performance boost

pekk · on Nov 4, 2015

This is par for the course on the TechEmpower benchmarks. Certain technologies' benchmarks have always been relentlessly optimized while others are left to "the community" which means you can submit PRs, but they'll likely be rejected even if they reflect the usual deployment patterns for a high proportion of users of the language/framework being represented.

koshak · on Nov 4, 2015

that's why this benchmark seems irrelevant.

ntrepid8 · on Nov 4, 2015

I've been using elixir/cowboy/Phoenix for the past few months and it's been great. It seems like the elixir ecosystem is still being built out but there are a lot of erlang libraries to use also.

alberth · on Nov 4, 2015

I'd be interested in seeing this same performance benchmark run against various other OSes.

In particular DragonflyBSD since it has a lockless network stack.

And FreeBSD for good measure given that Whatsapp runs an Erlang/FreeBSD stack and has documented 2-3M connections themselves.

chrismccord · on Nov 4, 2015

We were shocked how well stock ubuntu performed under these tests. With the stories of freebsd tuning, we expected to hit an OS ceiling, but other than a dozen shell script lines setting some sysctrl limits, everything else was stock. I would also be very interested in how other OS's perform.

ch4s3 · on Nov 3, 2015

This is really cool, I'm hoping to see more small companies experimenting with Phoenix. It seems like a great tool.

Thaxll · on Nov 3, 2015

Is it a relevant benchmark to open 2M connexions without doing anything in it?

chrismccord · on Nov 3, 2015

I was worried some might not make it that far into the post and come away with that thought, but such is the nature of long-form prose :). This is more than idle WS connections. Every channel client connects to the sever (via WS), then subscribes via our PubSub layer to the chat room topic. Once we connect all 2M clients, we use the real chat app to broadcast a message to all 2M users. Broadcasts to out to all WS clients in about 2s. 2M people in the same chat room might not be a real-world fit, but it stressed out PubSub layer to its max, which was one of the goals for these tests. We were actually shocked at how well broadcasts hold up at these levels.

BenoitP · on Nov 3, 2015

What this benchmark shows is the lightweightness of the memory needed per websocket. 2M users is really impressive.

2M users per machine is great if they are mostly idle. This is the use case of WhatsApp, and their stats are[1]:

> Peaked at 2.8M connections per server

> 571k packets/sec

> >200k dist msgs/sec

Not every app is meant to have mostly idle users. Can a real-time MMO FPS be done in phoenix is the question (lets limit each user's neighbourhood to the 10 closest players).

I'd be very interested in the other corners of the envelope for phoenix: A requests/second benchmark over websockets, with the associated latency HdrHistogram, like [2]

[1] http://highscalability.com/blog/2014/2/26/the-whatsapp-archi...

[2] http://www.ostinelli.net/a-comparison-between-misultin-mochi...

chrismccord · on Nov 3, 2015

> Not every app is meant to have mostly idle users though. Can an real-time MMO FPS be done in phoenix is the question (lets limit each user's neighbourhood to the 10 closest players).

Yes, and the next phase for our tests will explore these kinds of use-cases. To give you an idea, our pubsub layer can support 500k messages/sec on a macbook. These tests were specifically around max clients and max subscribers, but more tests are needed for hard numbers around the usecaes you layed out, which we'll be a great fit for. I think gaming will be huge target for Phoenix.

mguillemot · on Nov 4, 2015

I am a very happy user of Phoenix myself, and also a MMO game programmer, and unfortunately I can say with confidence that you won't implement a real-time MMO FPS on top of Phoenix channels any soon.

Not because of Phoenix itself, but because all modern lag-compensating techniques rely on specific properties of UDP that are not implementable over TCP (and thus not over Websockets or HTTP long poll, which are the currently supported Phoenix Channel transports).

Not to diminish the value of Phoenix: the framework is really pleasant to use both on dev side and ops side. And you could use it today to implement the server-side of most games, even some "massive multiplayer" ones, as long as latency is not your primary concern.

chrismccord · on Nov 4, 2015

You are right our default transports (WS/LongPoll) are not well suited for FPS requirements, but just to be clear transports are adapter based and you can implement your own today for your own esoteric protocol or serialization format. Outlined here: http://hexdocs.pm/phoenix/Phoenix.Socket.Transport.html

The backend channel code remains the same and the transport takes care of the underlying communication details.

lilyball · on Nov 6, 2015

2.8M connections in a real-world scenario? Wow. I was wondering what kind of technology they used to get that, and I guess I shouldn't be surprised to see that it's Erlang.

sugarpirate · on Nov 3, 2015

There is a video (https://www.youtube.com/watch?v=N4Duii6Yog0) at the end of the post which shows Chris demoing the chat room app.

rozap · on Nov 3, 2015

the other day i set up hot code upgrades (in a single node, production instance) with exrm and phoenix. it was a breeze and took about 30 minutes of reading and setup. and now i can upgrade the backend while all the websocket connections stay running. so cool!

nulltype · on Nov 4, 2015

Now that you can buy servers in tiny linear increments, it seems more interesting (but less headline-y) to talk about KB per connection or CPU per connection. It looks like this is about 41KB per connection.

For comparison, I checked my not-very-optimized go server, and it looks like (based on VSZ value) to use 25KB per connection in go, with 43KB per connection used by nginx (to provide SSL termination).

chrismccord · on Nov 4, 2015

Like I highlighted elsewhere, did your Go server setup a separate multiplexed process for the PubSub channel and track the PubSub subscribers on each client connection? Comparing the client size to raw WS connections isn't accurate in this context. Horizontal scaling is nice, but you also have to consider the overhead in broadcasts that now are distributed over dozens/hundreds of nodes vs a handful of large nodes.

nulltype · on Nov 4, 2015

I track the pubsub subscribers on each client connection, but I'm not sure what you mean by a multiplexed process for the pubsub channel.

Depending on your traffic patterns, considering the broadcast activity is potentially important, and then you might worry more about CPU than RAM.

I use redis myself, but gnatsd or redis cluster may help with scaling the pubsub part of the system.

lilyball · on Nov 6, 2015

Great article! One minor nit: When explaining the bag vs duplicate_bag thing, it says

> The difference between bag and duplicate_bag is that duplicate_bag will allow multiple entries for the same key.

bag also allows multiple entries per key. After reading the ETS documentation, it appears that duplicate_bag allows the same object instance to appear as a value for a key multiple times, whereas bag only allows an object instance to be added once to a key (e.g. so if you add the exact same object instance to the table using a given key, bag will only end up with one value for the key, but duplicate_bag will happily have many identical values for that key).

The following sentence is still fine:

> Since each socket can only connect once and have one pid, using a duplicate bag didn't cause any issues for us.

jkarneges · on Nov 3, 2015

I love this kind of stuff. Awhile back I attempted to do similar benchmarking with Pushpin: http://blog.fanout.io/2013/10/30/pushing-to-100000-api-clien...

Like the Pheonix test, this tested broadcast throughput. However we couldn't push more than 5000 messages per second out of a single EC2 m1.xlarge instance otherwise latency would increase. My theory is that we were hitting the maximum throughput of the underlying NIC, causing packet loss and requiring TCP to retransmit (hence the latency). At some point I want to try adding multiple virtual NICs to the same EC2 instance and see if that helps.

philsnow · on Nov 3, 2015

ec2 instances (at least in ec2 classic) indeed have a limit of ~120K packets per second. I don't know if this is per NIC, so I would be interested in seeing your results.

arrty88 · on Nov 3, 2015

am i reading this correctly... 83gb of ram in use for 2m connections? seems like a lot of overhead.

i have achieved 500k concurrent connections using atmosphere, consuming around 10gb of ram.

chrismccord · on Nov 3, 2015

Also keep in mind our clients aren't just idle WS processes. They start a channel process for each topic they join ("rooms:lobby") in this case. So 1 WebSocket connection will spawn two processes. One for the transport (WS handler), and one for the channel. The channel also holds its own state, which is mostly empty, but carries metadata, so it's hard to compare to your 500k raw connections.

chrismccord · on Nov 3, 2015

That's correct. We have not made any attempt at optimizing connection size yet, so there should be room to trim this down. But honestly, I'm thrilled with the numbers as is. This box on rackspace will run you $1500/mo, which is going to be peanuts for a company requiring this kind of scale, but the more we can trim down the connection size the better.

darthdeus · on Nov 3, 2015

> which is going to be peanuts for a company requiring this kind of scale

That depends. Two million connections from people in an e-commerce site is a lot, but two million connections for a side-thingy like some analytics/ads/background-job-whatever doesn't have to be that much, especially if you consider 3rd party code.

chrismccord · on Nov 3, 2015

Cases will vary, but if you have 2 million active users at a time and can't cover $1500/mo, then I'm not sure what your options will be. Along these lines, I'm really excited about what kinds of creations Phoenix will enable exactly because the kind of scale it gives you for current hardware. I think we'll see disruptive applications/services come out because of what you can get out of a single machine, so I find stressing price as not affordable in this context really bizarre.

darthdeus · on Nov 3, 2015

I understand, I do like the idea of Phoenix, and it is definitely great to be able to keep 2M connections open at all.

I'm just wondering where is the per-connection overhead going into. Is there some inherent limitation of the WebSocket protocol that forces the server to keep large buffers or something? Not trying to bash on Phoenix, I'm just genuinely interested in what is the lowest possible overhead one could achieve while keeping an open WS connection.

asonge · on Nov 3, 2015

Doing some rough math and my limited understanding of linux network internals, it's about 40KB per connection in this benchmark. I know that cowboy is going to require ~4KB or so per connection. Consulting a local ubuntu install, the default minimum buffer sizes in TCP will be at least 4KB each (2* for read and write), but by default 16KB each, and by default the max goes to 1.5MB or so each. This is required for TCP retransmits and such. If you have clients on shoddy connections or see packet loss, your memory could skyrocket on you. I remember reading of a case where someone had a service die despite memory overhead of 33% when the TCP packet loss rate went up (still under 1%), but it caused their buffer sizes to grow large enough to run out of memory.

So that's 8KB (will be higher with more usage) for TCP buffers in the kernel, 4KB or so for cowboy, and 28KB or so left for various other bits of the system when amortized per connection.

darthdeus · on Nov 3, 2015

That's only about twice as much as in your case (I'm not saying it's not a lot, just pointing it out).

Still, I wonder how low could one get when implementing this at a much lower level. 41kB per connection seems like a lot of bookkeeping for something that's essentially a handle to a socket? Yes there's a process overhead in BEAM, but based on the Erlang docs, this should be only 309 memory words.

paulus_magnus2 · on Nov 3, 2015

isn't there a 65k limit of number of open ports on the TCP stack??

asonge · on Nov 3, 2015

As noted above, TCP connection limits are only on a unique 4-tuple of local-ip, local-port, remote-ip, remote-port.