Woe be unto you for using a WebSocket

oautholaf · on Dec 22, 2021

I worked for a while on a well-known product that used (and perhaps still uses) WebSockets for its core feature. I very much agree with the bulk of the arguments made in this blog post.

In particular, I found this:

- Our well-known cloud hosting provider's networks would occasionally (a few times a year) disconnect all long-lived TCP sockets in an availability zone in unison. That is, an incident that had no SLA promise would cause a large swath of our customers to reconnect all at once.

- On a smaller scale, but more frequently: office networks of large customers would do the same thing.

- Some customers had network equipment that capped the length of time of that a TCP connection could remain open, interfering with the preferred operation

- And of course, unless you do not want to upgrade your server software, you must at some point restart your servers (and again, your cloud hosting provider likely has no SLA on the uptime of an individual machine)

- As is pointed out in the article, a TCP connection can cease to transmit data even though it has not closed. So attention must be paid to this.

If you use WebSockets, you must make reconnects be completely free in the common case and you must employ people who are willing to become deeply knowledgeable in how TCP works.

WebSockets can be a tremendously powerful tool to help in making a great product, but in general they are almost always will add more complexity and toil with lower reliability.

(edited typos)

bri3d · on Dec 22, 2021

I built several large enterprise products over WebSockets. I didn't find it that bad.

Office networks that either blocked or killed WebSockets were annoying. For some customers they were a non-starter in the early 2010s, but by 2016 or so this seemed to be resolved.

Avoiding thundering herd on reconnect is a very explored problem and wasn't too bad.

We would see mass TCP issues from time to time as well, but they were pretty much no-ops as they would just trigger a timeout and reconnect the next time the user performed an operation. We would send an ACK back instantly (prior to execution) for any client requested operation, so if we didn't see the ACK within a fairly tight window, the client could proactively reap the WebSocket and try again - customers didn't have to wait long to learn a connection was alive and unclosed.

> If you use WebSockets, you must make reconnects be completely free in the common case

I agree with this, or at least "close to completely free." But in a normal web application you also need to make latency and failed requests "close to completely free" as well or your application will also die along with the network. This is the point I make in my sibling comment - I think distributed state management is a hard problem, but WebSockets are just a layer on top of that, not a solution or cause of the problem.

> you must employ people who are willing to become deeply knowledgeable in how TCP works.

I think this is true insofar as you probably want a TCP expert somewhere in your organization to start with, but we never found this particularly complicated. Understanding that the connection isn't trustworthy (that is, when it says it's open, that doesn't mean it works) is the only important fundamental for most engineers to be able to work with WebSockets.

playing_colours · on Dec 23, 2021

> Avoiding thundering herd on reconnect is a very explored problem and wasn't too bad.

Can you please share approaches to mitigate this issue?

dcuthbertson · on Dec 23, 2021

As rakoo said, exponential backoff mitigates the thundering herd. I was going to say add some jitter to the time before reconnecting, then I realized rakoo already said "after a random short time", which is exactly what jitter is. (edited for coffee kicking in)

malandrew · on Dec 23, 2021

Congestion avoidance algorithms such as TCP Reno and TCP Vegas. basically code clients to back off if they detect a situation where they may be a member of a thundering herd.

rakoo · on Dec 23, 2021

Exponential back off. Basically try to reconnect after a random short time, if that doesn't work try with a time twice longer, then twice again, etc..

slaymaker1907 · on Dec 24, 2021

Usually you want the 2x wait to be a random time between 1.5x and 2x longer or something.

apitman · on Dec 22, 2021

> Office networks that either blocked or killed WebSockets were annoying

Curious how did they detect WS usage? Were you running on HTTP or did they just kill any long-lived TCP connection? Root certs?

bri3d · on Dec 22, 2021

No, we always ran on TLS. There were a few classes of these:

* Filtering MITM application firewall solutions which installed a new trusted root CA on employee machines and looked at the raw traffic. These would usually be configured to wholesale kill the connection when they saw an UPGRADE because the filtering solutions couldn't understand the traffic format and they were considered a security risk.

* Oldschool HTTP proxy based systems which would blow up when CONNECT was kept alive for very long.

* Firewalls which killed long-lived TCP connections just at the TCP level. The worst here were where there was a mismatch somewhere and we never got a FIN. But again, because we had a rapid expectation for an acknowledgement, we could detect and reap these pretty quickly.

We also tried running WebSockets on a different port for awhile, which was not a good idea as many organizations only allowed 443.

throwaway81523 · on Dec 22, 2021

> But again, because we had a rapid expectation for an acknowledgement, we could detect and reap these pretty quickly.

I found the best way to handle this was with an application level heartbeat. That bypassed dealing with any weirdness of the client firewalls, TCP spoofing, etc.

mmis1000 · on Dec 23, 2021

Something like ping every 30 seconds and say goodbye to the socket if we don't receive 2 seems work reasonably well.

And it also prevent most idle killing base tcp disconnect from happening.

And even if some network is so dumb that decides to kill it under 30s, it is a non issue as that network won't be even usable in normal means. (How do you download any big file if it always disconnect instantly?)

xyzzy_plugh · on Dec 22, 2021

> disconnect all long-lived TCP sockets in an availability zone in unison

I don't know what this means, but it sounds ridiculous. This would cause havoc with any sort of persistent tunnel or stateful connection, such as most database clients. Do you perhaps mean this just happens at ingress? That is much more believable and not as big of a deal.

> office networks of large customers would do the same thing.

Sounds like a personal problem. In all seriousness, your clients should handly any sort of network disconnect gracefully. It's foolish to assume TCP connections are durable, or to assume that you won't be hit by a thundering herd.

Maybe I'm old fashioned but TCP hasn't changed much over the years, none of these problems are novel to me, it's well-trodden ground and there are many simple techniques to building durable clients.

Also, all of the things you mention also affect plain old HTTP, especially HTTP2. There shouldn't be a significant difference in how you treat them, other than the fact you cannot assume they're all short lived connections.

oautholaf · on Dec 22, 2021

Most applications written using HTTP, in my experience, do not have deep dependencies on the longevity of the HTTP2 connection. In my experience, TCP connections for HTTP2 are typically terminated at your load balancer or similar. So reconnections here happen completely unseen by either the client application in the field or the servers where the business logic is.

For us -- and I think this is common -- the persistent WebSocket connection allowed a set of assumptions around the shared state of the client and server that would have to be re-negotiated when reconnecting. The fact that this renegotiation was non-trivial was a major driver in selecting WebSockets in the first place. With HTTP, regardless of HTTP2 or QUIC, your application protocol very much is set up to re-negotiate things on a per-request basis. And so the issues I list don't tend to affect HTTP-based applications.

xyzzy_plugh · on Dec 22, 2021

> the persistent WebSocket connection allowed a set of assumptions around the shared state of the client and server that would have to be re-negotiated when reconnecting. The fact that this renegotiation was non-trivial was a major driver in selecting WebSockets in the first place. With HTTP, regardless of HTTP2 or QUIC, your application protocol very much is set up to re-negotiate things on a per-request basis. And so the issues I list don't tend to affect HTTP-based applications.

I think this describes a poor choice in technology. There's no silver bullet here, and it sounds like you made a lot of questionable tradeoffs. Assuming that "session" state persists beyond the lifetime of either the client or the server is generally problematic. It's always easier for one party to be stateless, but you can become stateful for the duration of the transaction.

Shared state is best used as communications optimization, and maybe sometimes useful for security reasons.

Dylan16807 · on Dec 23, 2021

> Assuming that "session" state persists beyond the lifetime of either the client or the server is generally problematic.

I don't think you're interpreting the problem right? The state is tied to the connection, not outliving client or server. But it outlives single requests, and would be uncomfortably expensive to re-establish per request.

xyzzy_plugh · on Dec 23, 2021

I'm saying is that it's unrealistic to expect to hold a persistent TCP connection for an extended period of time across networking environments you do not control.

Making things not uncomfortably expensive is a good idea.

Relying on websockets to solve this for you is a mistake. It's convenient, but not robust. How would you solve it without websockets using traditional HTTP? The same solution should be used with websockets, but unlocks tremendous opportunities for optimization.

Dylan16807 · on Dec 23, 2021

> How would you solve it without websockets using traditional HTTP?

You'd probably do the uncomfortably expensive setup, then give the client a token and store the settings in a database. And then do your best to cache it and have fast paths to reestablish from the cache on the same server or on different servers.

Not only could this add a lot of complication, now you've actually introduced the problem of state outliving your endpoints! You do unlock new ways to optimize, but you pay a high cost to get there. There's a very good chance this rearchitecture is a bad idea.

xyzzy_plugh · on Dec 23, 2021

Sure. Look I'm not advocating for any particular solution here, just trying to point out the hopefully obvious fact that websockets are not a silver bullet. You've basically described why websockets unlock optimizations, which was my point.

Nothing in the GP's post is novel to websockets. Session based resource management is difficult, doubly so for long lived sessions. Relying on websockets to magically make that easy is foolish.

> Not only could this add a lot of complication, now you've actually introduced the problem of state outliving your endpoints!

I only want to point out that this is true with websockets as well, so I find this argument unconvincing. For websockets, what do you do when re-establishing a connection? You start anew or find the existing session. What if the client suddenly disappears without actively closing the connection? You have some sort of TTL before abandoning the session.

It's the exact same problem either way.

tyingq · on Dec 22, 2021

>Sounds like a personal problem. In all seriousness, your clients should handly any sort of network disconnect gracefully

That can be complex. Corporate MITM filtering boxes, "intrusion detection" appliances, firewalls, etc, can just decide to drop NAT entries, drop packets, break MTU path discovery, etc. Yes, there are things you can do. But then customers restart/reload when things don't happen instantly, etc. I don't know that there's a simple playbook.

inopinatus · on Dec 22, 2021

None of this is particular to websockets, and in addition:

> you must employ people who are willing to become deeply knowledgeable in how TCP works

You already needed that for your HTTP based application; it's a fundamental of networked computing. Developers skipping out on mechanical sympathy are often duds, in my experience.

starik36 · on Dec 23, 2021

> employ people who are willing to become deeply knowledgeable in how TCP works

I used Microsoft's SignalR library. It knows TCP pretty well and handles most of the common pitfalls nearly automatically.

> customers to reconnect all at once.

That is definitely a problem. So we had to code it from the get go with the assumption that either the network will go down or the server will be bounced for an upgrade.

Actually most of the issues I encountered had to do with various iPad versions going to sleep and then handling WebSockets in different ways once it woke up.

tata71 · on Dec 23, 2021

Hosted? How are your costs? I hear this catches people sometimes.

Any other advice for SignalR?

api · on Dec 22, 2021

Quic is the right idea. Encrypt everything including state. Kill middle boxes.

History has shown that if you allow middle boxes they will ruin everything.

vlovich123 · on Dec 22, 2021

> - Our well-known cloud hosting provider's networks would occasionally (a few times a year) disconnect all long-lived TCP sockets in an availability zone in unison. That is, an incident that had no SLA promise would cause a large swath of our customers to reconnect all at once.

I’m kind of surprised that it was that infrequent. I would expect software upgrades should cause long-lived sockets to reset…

inopinatus · on Dec 22, 2021

or a scale-up of an ELB

alexellisuk · on Dec 23, 2021

> - Some customers had network equipment that capped the length of time of that a TCP connection could remain open, interfering with the preferred operation

What's the alternative that's going to work here?

bri3d · on Dec 22, 2021

I think this conflates a very specific paradigm for using WebSockets (state synchronization with a stateful "server") with WebSockets as a technology.

At the end of the day, WebSockets are just, well, sockets, with a goofy HTTP "upgrade" handshake and some framing on top of them. You could implement the exact same request/response model as in an HTTP based service over a WebSocket if you wanted to.

Stateful services are a tradeoff whether you use a WebSocket or not.

Reading through here, I think what you're trying to build is a synchronized stateful distributed system where state management becomes more transparent to the engineer, not only across backend services but also between the browser and the service side - this is well tread ground and a huge problem, but an interesting one to take on nonetheless. "WebSockets" are a red herring and just an implementation detail.

ljm · on Dec 23, 2021

The article advocates that the client pulls data via polling (as I understand it), ignoring over a decade of precedence for using websockets for pushing data to a client or broadcasting it to many clients. It's up to the clients to consume that data and react to it if they even want to. There's nothing particularly bad about pub/sub there.

It also talks about a 'command pattern' at that point, which sounds like they're complaining about RPC really.

mathgladiator · on Dec 22, 2021

You're right. The key thing is the spectrum of freedom induced by the library/framework. For example, if use something like PHP then you live in a prison of the request lifecycle. Sometimes the prison is socially enforced by not having shared state between requests within a process.

There is nothing special about websockets, but they do confer a freedom and responsibility.

bborud · on Dec 23, 2021

It isn't uncommon for people to get confused about layering. In fact, it is particularly hard when you have a lot of people who have used something for a long time without actually understanding the design of it.

I'm not a fan of how websockets are implemented in browsers. It's a bit too "magical" for my taste. But what people seem to be complaining about is a mismatch between how they think networking works and how it actually works.

This really isn't any different from the kinds of problems I'd teach beginners to get around 25 years ago when dealing with ordinary TCP sockets. What has changed is that programmers generally tend to know a lot less about the underlying technology these days (because there is so much extra complexity to worry about).

jeroenhd · on Dec 22, 2021

I've only professionally used WebSockets with Spring Boot and React and I must say: they're perfectly fine?

That is, if you use WS to simply asynchronously communicate events and do out-of-band message passing, they operate quickly, easily and efficiently. I wouldn't use them to send binary blobs back and forth or rely on them to keep a perfect state match, but for notifications and push events they're a delight to work with. Plus, their long-term connectivity gives them an edge above plain HTTP because you can actually store a little state in sockets rather than deriving state from session cookies and the like.

Yes, WebSockets don't fit well within the traditional "one request, one response, one operation" workflow that the web was built upon, but that model is arguably one of the worst problems you encounter when you use HTTP for web applications (not websites, though; for websites, HTTP works perfectly!). Most backend frameworks have layers upon layers of processing and securty mitigations exactly because HTTP has no inherent state

WebSockets aren't some magical protocol that will make all of your problems go away but if used efficiently, they can be a huge benefit to many web applications. I've never used (or even heard of) Adama, so I can totally believe that websockets are a terrible match for whatever use case this language has, but that doesn't mean they deserve to get such a bad rep. You just have to be aware of your limitations when you use them, the same way you need to be aware of the limitations of HTTP.

superice · on Dec 22, 2021

Absolutely right, until you have multiple instances of backends you need to deal with and synchronize, which is a pain in the behind. The OPs main issue seems to be with that problem, which is the tough component of using something stateful like websockets anyways. The impedance mismatch of ‘something happened in the database’ to ‘send event over websocket’ is painful in a multi backend instance environment.

Case in point: dossier locking in the product we both worked on. Hi Jeroen! Always nice to find an old colleague on here :)

jeroenhd · on Dec 22, 2021

Ha, hey there! Nice username :)

You're totally right, of course; for shared states between different backend instances you'll need a different solution, like database locking or complicated inter-backend API calls, or even a separate (set of) microservice(s) to deal purely with websockets while other backend operate on the database. That way you can apply scaling without data consistency issues, if you want to go for a really (unnecessarily) complicated solution.

It all depends on your problem space. If you want to make a little icon go green on a forum because someone commented on your post, I think websockets are perfect, much better than the long-lived HTTP polls of yore.

I think the OP is using websockets to synchronize game state across different clients, which can be quite tricky even without having to deal with scaling or asynchronous connections. You can use websockets for that, but manual HTTP syncs/websocket reconnects after a period of radio silence would not go amiss. Hell, if it's real-time games this is about, you might even want a custom protocol on top of WebRTC to get complete control over data ad state with much better performance.

jayd16 · on Dec 23, 2021

A stream isn't "one request, one response" but you can still write your websocket endpoints to be stateless. Once you realize that, its not so hard to work with.

swagasaurus-rex · on Dec 22, 2021

For the very reasons listed in the article, I built:

https://github.com/siriusastrebe/jsynchronous

a library for keeping a javascript variables synchronized between Node.js servers and clients.

Websockets work great for message passing but it struggles with data structures more complicated than what JSON can represent. Jsynchronous syncs any javascript object or array with arbitrarily deep nesting and full support for circular data structures.

If a computer goes to sleep, or disconnects, websocket connections (and their underlying TCP connections) get reset so you lose any data sent while a computer is unavailable. This is catastrophic for state-management if it's left unhandled. Jsynchronous will re-send any data clients are missing and reconstructs the shared state.

There's also a history mode that lets you rewind to past states.

apitman · on Dec 22, 2021

Very cool. I tried something similar once. Have you gone down the differential/compressed update rabbit hole yet?

swagasaurus-rex · on Dec 22, 2021

Right now jsynchronous message passes using custom encoding in JSON which keeps simple changes as small as an HTTP header, but with byte level encoding I think this could be halved.

Some compression on top would probably do wonders for huge volumes of changes, though more browsers are supporting compression on the websocket level.

I'm not familiar with the name differential update. Instead of passing potentially large states back and forth, Jsynchronous numbers each change to your synchronized data and shares these changes with all connected clients. This is called Event Sourcing, and it enables jsynchronous to rewind to previous states by running the changes from start to any intermediate state.

adamddev1 · on Dec 22, 2021

Very, very nice! I might use that! Just gotta wrap it a nice React hook.

winrid · on Dec 22, 2021

WebSockets are great when used in addition to polling. This way, you can design a system that doesn't result in missed events. Example: have a /events?fromTS=123 endpoint.

At FastComments - we do both. We use WS, and then poll the event log when required (like on reconnect, etc).

Products that can get away with just polling should. In a lot of scenarios you can just offload a lot of the work to companies like OneSignal or UrbanAirship, too.

If you're going to use WS and host the server yourself, make sure you have plans for being able to shard or scale it horizontally to handle herds.

It was hard for us to not use websockets, since like 70% of our customers pick us for being a "live" solution for live events etc.

klabb3 · on Dec 22, 2021

> Example: have a /events?fromTS=123 endpoint.

This is what SSE (Server Side Events) is designed to do. A stripped down version of websockets which:

- auto reconnects

- exposes an optional offset

- only allows server -> client streaming

SSE is also supported by most web browsers.

tored · on Dec 23, 2021

Depending on what you are trying to achieve, I also recommend SSE over websockets, especially if all you want is to signal clients when state changes on the server.

SSE is a simple protocol that you can easily implement yourself both in the server and the client if the client lacks support.

SSE also adds naturally to existing infrastructure of request-response if you only use SSE for notification and keep everything the same, i.e. use same endpoints as before for fetching new data on a SSE notification event, and thus can be turned off as easily if problematic, e.g. too high server load.

klabb3 · on Dec 23, 2021

Yeah I agree. But even though SSE is super easy to grok and to implement (literally just standardized long polling), lots of existing infra builds on the assumptions that connections are short lived, so many of the WS issues apply to SSE as well.

IMHO, this unfortunate assumption is not really defensible in $current_year - especially from the multi billion dollar Cloud industry. I'd much more prefer first class support for long-lived connections on an infrastructure level, as opposed to a "proprietary database-level". I don't buy the argument that it's infeasible to solve the thundering herd issues.

jimmont · on Dec 23, 2021

server-sent events (SSE) likely also improves battery life, more via https://developer.mozilla.org/docs/Web/API/Server-sent_event... supported also: https://deno.land/x/oak@v5.2.0/docs/sse.md https://html.spec.whatwg.org/multipage/server-sent-events.ht...

aidenn0 · on Dec 23, 2021

I remember when I first heard about websockets, I was wondering what exactly it was useful for that SSE didn't already do. Almost all of the demos at the time were easier (IMO) to do with SSE. The two standards also both came from WHATWG at about the same time.

[edit]

I looked it up and SSE was a much earlier standard, but implementation of WS and SSE were relatively contemporary with the exception of Opera (had SSE in 2006) and IE (Never got SSE support).

klabb3 · on Dec 23, 2021

I didn't know that SSE came first, thanks for adding that context.

It does feel like websockets tried to cram several novel features whereas SSE was simply giving proper clothing to the existing art of long polling.

In particular, WS is binary encoded, has support for multiplexing/message splitting, several optional http headers, which in hindsight appears to have simply complicated the spec at little-to-nil value.

winrid · on Dec 23, 2021

SSE have a lot of restrictions that make them unattractive, like a global limit of 6 per browser session. This can cause confusing behavior for power users...

klabb3 · on Dec 23, 2021

Hm yeah now that you mention it I recall that as well. Isn't that just an arbitrary crippling though? I can't imagine a good reason for why SSE would be hogging more resources than websockets.

apitman · on Dec 22, 2021

Is it really worth the extra effort for WS over long polling at that point though? Especially if you're re-using the TCP connection it seems like the overhead would be minimal and the latency only slightly increased.

winrid · on Dec 22, 2021

Sorry for the misunderstanding, but I don't mean WS over long polling. I mean WS in addition to polling, not long polling. Use websockets, but also expose an API to get the same events by specifying a timestamp. This way the websocket server implementation can be much simpler, and the client just has to call the API to "catch up" on missed events on reconnect.

You can also use this API for integrations, and your clients/consumers will thank you. For example, our third party integrations use the event log to sync back to their own data stores. They probably call this every hour, or once a day. You wouldn't want to use websockets with PHP apps like WordPress.

typingmonkey · on Dec 23, 2021

I can totally relate to that. When designing the RxDB GraphQL replication [1] protocol, it made things so much easier when the main data runs via normal request-response http. Only the long-polling is switched out for WebSockets so that the client can know when data on the server has changed. This makes it realy easy to implement the server side components when having a non-streaming database.

[1] https://rxdb.info/replication-graphql.html

mathgladiator · on Dec 22, 2021

You're right which is where I ended the conversation.

Ideally, you should be able to poll because it is resilient. The challenge however is when you separate the initial poll/pull from the update stream because now you have to maintain two code paths. What I'm proposing is that the poll and update stream use the same data format using patching.

winrid · on Dec 23, 2021

It's one code path for publishing events. Devs don't know about both.

Overall we don't have much code since our websocket server is just nginx.

jayd16 · on Dec 23, 2021

Why a separate poll instead of adding the initial offset to the websocket request url or handshake? Just to be compatibly with websocket hostile networks?

winrid · on Dec 23, 2021

To simplify the websocket server. It is simply a fan out mechanism, no app code lives there.

eska · on Dec 22, 2021

I’m confused. Just because you use request response doesn’t mean you don’t have state in your page. And sure, your connection can drop, but your requests can also fail. Protocol versioning is required, but that is true for any protocol ever, also request response. And what’s this about wanting to outsource all state and everything requires a database as soon as it has state? Which one does mspaint.exe use? And if using a load balancer or proxy somehow leaks through then something really messed up is going on. In pubsub, just like REST, one doesn’t care who sent the data. I’m confused.

mathgladiator · on Dec 22, 2021

Sorry about that. Yes, those are required but you have more freedom to go wrong in more severe ways.

Mspaint uses the file system at command of the user. Something that I didn't mention (which I will I'm a revision) I'd that the server is being run by another person with their own deployment schedule. So, we generally like to excise state ASAP to a durable medium since process state is volatile.

spicybright · on Dec 22, 2021

All of these issues could be solved well by a decent library. Or even just basic well thought out abstractions.

I've written a few before from "scratch" myself to decent success. I don't think there's anything inherent with sockets that make it impossible to use well. I mean, we've been using sockets since the beginning of the internet.

You can write the same article for literally any technology that's complicated to use without abstractions.

mathgladiator · on Dec 22, 2021

That's the core thing is the discover of well thought out abstractions for a specific domain. I glossed over a lot of the specific details, but I iterated the pitfalls that I have seen and how I'm thinking about the abstractions I'm using.

FZambia · on Dec 22, 2021

Every time I read criticism of WebSockets it reminds me about WebSuckets (https://speakerdeck.com/3rdeden/websuckets) presentation :)

I am the author of Centrifugo server (https://github.com/centrifugal/centrifugo) - where the main protocol is WebSocket. Agree with many points in post – and if there is a chance to build sth without replacing stateless HTTP to persistent WebSocket (or EventSource, HTTP-streaming, raw TCP etc) – then definitely better to go without persistent connections.

But there are many tasks where WebSockets simply shine – by providing a better UX, providing a more interactive content, instant information/feedback. This is important to keep - even if underlying stack is complicated enough. Not every system need to scale to many machines (ex. multiplayer games with limited number of players), corporate apps not really struggle from massive reconnect scenarios (since number of concurrent users is pretty small), and so on. So WebSockets are definitely fine for certain scenarios IMO.

I described some problems with WebSockets Centrifugo solves in this blog post - https://centrifugal.dev/blog/2020/11/12/scaling-websocket. I don't want to say there are no problems, I want to say that WebSockets are fine in general and we can do some things to deal with things mentioned in the OP's post.

zemo · on Dec 22, 2021

> Not every system need to scale to many machines (ex. multiplayer games with limited number of players)

the author writes a websocket board game server. Most, if not all, of these complaints read like the author isn't partitioning the connections by game.

mathgladiator · on Dec 22, 2021

The question is at what level do you partition the connection. I could setup a server and then vend an IP to the clients. The problem with that strategy is how do you do recovery? Particularly in an environment where you treat machines like cattle.

If you don't vend an IP, then you need to build a load balancer of sorts to sit between the client and the game instance server. Alas, how do you find that game instance server? If you have a direct mapping, sticky header, or consistent routing. As long as you care about that server's state, it is the same as vending an IP to the client except you can now absorb DOS attacks and offload a bit of compute (like auth) to the load balancer fleet.

The hard problem is how much do you care about that server's lifetime? Well, we shouldn't because individual servers are cattle, and we can solve some of the cattle problems by having a nice shutdown to migrate state and traffic away. This will help for operator induced events with kindness. Machine failures, kernel upgrades, and other such things that affect the host may have other opinions.

loevborg · on Dec 22, 2021

My advice:

• Make push updates optional. If no push connection can be established, fall back to polling. You can start by implementing polling only and add push later.

• Use websockets only for server-to-client communication. Messages from the client are sent via regular HTTP requests.

• Keep no meaningful state on the server. That includes TCP connection state. You should be able to kill all your ec2 instances and re-spawn them without interrupting service.

• Use request/response for all logic on the server. All your code should be able to run in an AWS lambda.

• Use a channel/subscription paradigm so client can connect to streams they're interested in

• Instead of rolling your own websocket server, use a hosted service like pusher.com or ably.com. They do all the heavy lifting for you (like keeping thousands of TCP connections open) and provide a request/response style interface for your server to send messages to connected clients

tata71 · on Dec 23, 2021

What are the downsides? Are these hosted services just AWS?

thisrobot · on Dec 22, 2021

Full disclosure I work at MSFT and on the fluid framework.

If you are interested in this you may also be interested in the fluid framework, https://github.com/microsoft/FluidFramework

We use websockets and solve a lot of the state management problem called out here by keeping very little state on the server itself. The primary thing on server is a monotonically increasing integer we use to stamp messages, this gives us total order broadcast which we then build upon: https://en.m.wikipedia.org/wiki/Atomic_broadcast

Here are some code pointers if you want to take a look:

The map package is a decent place to look for how we leverage total order broadcast to keep clients in sync in our distributed data structures: https://github.com/microsoft/FluidFramework/blob/main/packag...

The deltamanger in the container-loader package is where we manage the websocket. It also hits storage to give the rest of the system a continuous, ordered stream of events:

https://github.com/microsoft/FluidFramework/blob/main/packag...

The main server logic is in the Alfred and Deli lambdas. Alfred sits on the socket and dumps message into Kafka. Deli sits on the Kafka queue, stamps messages, the puts them on another queue for Alfred’s to broadcast: https://github.com/microsoft/FluidFramework/tree/main/server...

cultofmetatron · on Dec 22, 2021

come to phoenix/elixir land. Channels are amazing and the new views system allows you to seamlessly sync state to a frontend using websockets with almost no javascript

mathgladiator · on Dec 22, 2021

Perhaps I will.

I'm thinking about how I position Adama for both Jamstack and as a reactive data-store (which could feed phoenix/elixer land). I intend to change my marketing away from the "programming language" aspect and more towards "reactive data store".

Zealotux · on Dec 22, 2021

My side-project relies heavily on WS, I'm currently using Node.js and it's alright but learning Elixir is my goal during these holidays. Any resource to share to get started? I don't know much about Elixir except it's perfect for such use-cases.

cultofmetatron · on Dec 22, 2021

I was a nodejs programmer for almost 8 years before I hopped to elixir. A lot of it was motivated btw elixir's realtime system and concurrency.

One of the issues with nodejs is that you're stuck to one process which is single threaded. IE: you're stuck to one core. Yes there are systems like clustering which relies on a master process spawning slave processes and communicating over a bridge but in my experience, its pretty janky.

You're making a great move by trying out elixir. It solves a lot of the issues I ran into working with nodejs. Immutability is standard and if you compare two maps, it automatically does a deep evaluation of all the values in each tree.

The killer feature however is liveview. Its what meteor WISHES it could be. realtime server push of html to the dom and teh ability to have the frontend trigger events on the backend in a process thats isolated to that specific user. Its a game changer.

Anyways, if you're looking for resources to learn. pragprog has a bunch of great books on elixir. Thats how I got started.

dqv · on Dec 22, 2021

I got pretty far in the beginning just by reading the docs on Phoenix channels [0]. I was learning Phoenix and ReactJS at the same time and got a pretty simple Redux thunk that interacted with Phoenix Channels in a few days. I'm not sure if that was the optimal way to do it, but it was really cool interacting with the application from IEx (elixir shell).

You might find an interesting (albeit more complex) entry point by interacting with Phoenix Channels from your Node.js app using the Channels Node.js client [1] and within the frontend itself.

[0]: https://hexdocs.pm/phoenix/channels.html [1]: https://www.npmjs.com/package/phoenix-channels

sb8244 · on Dec 22, 2021

I wrote Real-Time Phoenix and it goes into pretty much everything that I hit when shipping decently large real-time Channel application into production. Like, the basics of "I don't know what a Channel is" into the nuance of how it's a PITA to deploy WS-based applications due to load balancers and long-lived connections.

The new LiveView book is great if you're interested in going fully into Elixir (server/client basically). I use LiveView for my product and it's great.

cultofmetatron · on Dec 23, 2021

thanks for the excellent read. it was one of the pdfs sitting on my desktop when I was designing our event system at my startup.

lawn · on Dec 22, 2021

The book Real-Time Phoenix is good for designing realtime systems.

For a good introduction to Elixir, I loved Elixir in Action which also covers OTP.

If you want a good Phoenix resource, the book Programming Phoenix is good.

Finally, the docs in Elixir and Phoenix are great.

b5n · on Dec 22, 2021

Being already familiar with FP I was able to jump right into elixir after running through learn you some erlang. Erlang and elixir are very similar, and you get the bonus of learning a bit about OTP, BEAM, and the underlying philosophy. There's a ton of resources after that, and the docs are easy to use and understand.

https://learnyousomeerlang.com/

callamdelaney · on Dec 22, 2021

Channels are basically websockets and only supported inside Phoenix as far as I can tell..

cultofmetatron · on Dec 22, 2021

its more than just webockets. Its a prototcol built on web-sockets that also include keepalive and long-polling fallback.

You could in theory replicate the protocol in another stack but its more than that. Elixir is uniquely suited to websocket applications. Since a websocket is persistent, you need a process on your end that handles that. Elixir is excellent at creating lightweight threads for managing these connections. Out of the box, you can easily support a few thousand connections on a single server.

I know because we did it at my startup. channels powers our entire realtime sync system and its yet to be a bottleneck. It more or less works out of the box without issue. Its almost boring level reliable.

dnautics · on Dec 22, 2021

Minor nitpick: Channels are transport-agnostic, you could always do them over longpoll, it's not hard to impl a raw tcp socket channel driver, and hell you could probably figure out how to do it in streaming http 1.1. I might to try to do channels over webrtc.

callamdelaney · on Dec 22, 2021

Elixir inherits those properties from the Beam VM it shares with Erlang - both languages are effectively great for this WS/Channels use case - but Channels seems unhelpful unless you're already using phoenix. I can't see that it can be used in say Erlang.

sb8244 · on Dec 22, 2021

Channels are really just a protocol, but that protocol is implemented in Elixir (Phoenix) and so isn't available elsewhere.

I think that the important question is "why does this protocol exist?" Most likely, you'll end up solving similar problems as to why Channels exist in the first place. So from a protocol perspective, it's nice that some problems are solved for you.

prophesi · on Dec 22, 2021

May not fit your use-case, but you can create an umbrella project with both an Erlang app and an Elixir/Phoenix app, whereby the latter's capable of calling functions in the former.

PaulDavisThe1st · on Dec 22, 2021

We used WebSockets to build a web-based front end for Ardour, a native cross-platform DAW, and didn't encounter any of these issues. Part of that is because the protocol was already defined (Open Sound Control aka OSC) and used over non-web-sockets already. But as others have noted, most of the problems cited in TFA come from the design goals, not the use of websockets.

namelosw · on Dec 23, 2021

The answer is simple: Abstraction, or the lack thereof. Building real-time web applications are difficult. But the difficulty is accidental complexity that could be abstracted away. However, most of the tech stacks are not there yet - mainly because the request-response model is good enough for most of the sites, so the industry wouldn't push it forward wholeheartedly. Maybe the situation would start to change after WASM take off, or maybe not.

The best bet is to work on platforms that already have great WebSocket support. PHP might not be a great choice. Node.js is okayish but not great. Blazor sounds interesting but I'm not sure about the performance. Elixir/Phoenix is probably the best bet for now. It has nice, abstracted APIs, it's performant, and it can form clusters either by itself or with Redis.

It's very hard to scale WebSocket because of its stateful nature, so please look for platforms already solved this problem.

Philip-J-Fry · on Dec 22, 2021

You can turn websockets into a flawless request/response with async/await included on like 20 lines of JavaScript. I do it all the time.

Generate an ID, make a request, store the promise resolve/reject in a map (js object). Your onmessage handler looks up the promise based on the ID and resolves the promise.

Add a few more tiny features like messaging and broadcasting streams and you've got both request/response and push messaging over a single websocket.

It's pretty neat in my opinion, saves having to mix HTTP and websockets for a lot of things.

apitman · on Dec 22, 2021

It is neat. I've gone down this path[0]. But you find yourself essentially re-implementing HTTP, and losing things like backpressure. Sometimes it's what you need, but I try to be cautious before jumping to WS these days.

[0]: https://iobio.io/2019/06/12/introducing-fibridge/

gorgoiler · on Dec 23, 2021

I tried to make something with websockets a few weeks ago. My backend was a simple Python3 web server and I just didn’t have the chops to make it work alongside the asyncio websockets stuff.

The alternate solution which worked exceptionally well was to make an XHR which the server would hold open and only respond to once an event occurred. The moment the XHR response arrives, another one is set up to wait for the next event.

I made a proof of concept that sent my keystrokes from the server to the browser. It was so snappy it was like typing locally.

TameAntelope · on Dec 22, 2021

I kludged together subscription support for our GraphQL stack, and it's super scary because the product people see it working and are like, "this is awesome!" but sometimes it doesn't totally work, and I don't think they realize how hard it would be to, "just make it work all the time".

I want a "real" non-kludge solution, but it's hard to convince someone they need to give you money to throw away something that "works", from their perspective.

apitman · on Dec 22, 2021

A few years ago I was more inclined to use WebSockets. They're undeniably cool. But as implemented in browsers (thanks to the asynchronous nature of JavaScript) they offer no mechanism for backpressure, and it's pretty trivial to freeze both Chrome and Firefox sending in a loop if you have a fast upload connection.

I designed a small protocol[0] to solve this (and a few other handy features) which we use at work[1]. A more robust option to solve similar problems is RSocket[3].

More recently I've been working on a reverse proxy[2], and realized how much of a special case WebSockets is to implement. Maybe I'm just lazy and don't want to implement WS in boringproxy, but these days I advocate using plain HTTP whenever you can get away with it. Server Sent Events on HTTP/1.1 is hamstrung by the browser connection limit, but HTTP/2 solves this, and HTTP/3 solves HTTP/2's head of line blocking problems.

Also, as mentioned in the article, I try to prefer polling. This was discussed recently on HN[4].

[0]: https://github.com/omnistreams

[1]: https://iobio.io/2019/06/12/introducing-fibridge/

[2]: https://boringproxy.io/

[3]: https://rsocket.io/

[4]: https://news.ycombinator.com/item?id=27823109

hattmall · on Dec 22, 2021

I've never seen websockets accomplish anything that push with long polling didn't do more effective and efficiently. I think they are a technology that missed their time, the CPU / power savings are generally non-existent and the minimal bandwidth savings are frequently negated by the need to add in redundancy and checks.

apitman · on Dec 22, 2021

If SSE supported binary data I might agree with you. I think WebSockets have their place, but are overused. I definitely agree long polling should be implemented first and then only add WS if you've measured that you need it. You're probably going to need to implement polling anyway for clients to catch up after disconnect.

mgamache · on Dec 22, 2021

Using Blazor (.net). It's been mostly a good experience. Fast UI updates, good programming model.

mwattsun · on Dec 22, 2021

I've been looking at it lately and the SignalR tech it uses. Very nice. My cursory Google research indicates a server using it can handle about 3000 users, which is not many, but for my purposes is fine. Blazor makes it completely optional, which I proved to myself by doing client side Blazor (dot net wasm) only and using an Apache server instead of IIS with ASP.NET.

https://en.wikipedia.org/wiki/SignalR

mgamache · on Dec 22, 2021

Right, with .net core 6 the runtime wasm is much smaller. SignalR can be scaled if you need to, but for most sites 3000 concurrent users is plenty.

wvenable · on Dec 22, 2021

We're migrating to server-side Blazor because of the good experience yet all of the things mentioned in this article are a concern for using this technology. A few things, especially around deployment and maintenance, are significantly less good when your clients are always connected. I'm currently trying to figure out mitigations.

te_chris · on Dec 22, 2021

Worth looking at Elixir Phoenix Channels if you’re wanting to use sockets. Handle a lot of the messy stuff for you.

https://hexdocs.pm/phoenix/channels.html

Waterluvian · on Dec 22, 2021

I use WebSockets for robots to communicate real-time state with a web UI. They’re the only option and they’re fine.

I get that this article is more focused on a narrow perspective of the technology. But of course you can make silly use of any technology.

eska · on Dec 22, 2021

I also implemented MQTT for industrial machines to publish data to a broker. It was trivial to create a web UI that subscribed to that broker via MQTT over Websockets. But I noticed that colleagues had this impression that MQTT cannot be used on the web, so they wanted to build a conversion from MQTT to SignalR.. a quick search would've cleared it up, but they were so sure for some reason. After I showed them the demo they just went with MQTT as well.

rozap · on Dec 22, 2021

meh. we shipped an interactive ETL tool to a bunch of government customers and it relies heavily on the channel abstraction and pubsub provided by phoenix and elixir. maybe it's because message passing is the programming model all the way down the stack and all the way across the cluster, but long lived stateful connections are just not hard to deal with.

elixir has (plenty of) its own drawbacks, but wrangling a pile of topics over a websocket is not one of them. this can be a solved problem if you make some tradeoffs.

this system isn't huge, but has been running for years for plenty of customers irl.

jvanderbot · on Dec 22, 2021

Author: in this doc, headers are paragraph topic sentences, not bookmarks for disjoint ideas.

And they are nonsensical. "Problem: Bumps in the night#". Oh, thanks, let me bookmark that for easy sharing.

mathgladiator · on Dec 23, 2021

I'm using markdown and https://docusaurus.io/

And, yes, please bookmark for easy sharing.

rglover · on Dec 22, 2021

Make sure you handle disconnects/reconnects, leverage query params to pass state (ideally limited), sync/scale via Redis pubsub, and assume that the websocket will fail so always have some sort of fallback/redundacy. Ideally think of websockets like sprinkles on ice cream.

Shameless plug: https://cheatcode.co/courses/how-to-implement-real-time-data...

wruza · on Dec 23, 2021

I read up to 5-6 “problems” and still don’t understand the exact setting they are talking about. All websocket services I have used alleviated tcp issues by some sort of a heartbeat or planned close/reconnect (just like long-polling, which is basically the same thing from tcp perspective). State sync is not a problem if you don’t lose any state on server restarts, which you shouldn’t lose anyway. Moreover, you don’t have to have a conversation over a socket, request can be made over a regular xhr, only realtime events ({type:event, msg:{id, status}}) go back via socket, iff the client is interested in these. If a socket fails, they may reload the page and get this info via xhr again (/api/get-status?id=). The queue is usually natural, the messages are usually realtime.

the state on that machine must survive the failure modes of the proxy talking to it. That is, the state must be found. … However, this creates a debt such that a catastrophic socket loss creates a tremendous reconnect pressure.

It is terrifying indeed, but boils down to “don’t try to resend the past”, no FUD required. The state of a socket is the same as of an http request: they connect or auth with some id/jwt and then a socket can send events back again. If you can’t hold that reconnect+auth pressure, you likely can’t hold long-poll reconnect+auth pressure, and less likely but still probably you can’t serve clients who refresh their pages aggressively. Put an adaptive rate limiter before your routes already.

This is a talk with strange assumptions and without technical details, what is their point beyond selling some Adama? We lived up to the point when people are afraid of streaming in favor of repeated polling?

badrabbit · on Dec 22, 2021

There is this particular app I have to use for work which relies heavily on web sockets. It works great so long as your latency is good. But as soon as your latency is over some threshold the sheer volume of websocket requests compounds and it becomes unusable. Normal sites just load slower, they don't fail entirely like this.

I guess it goes without saying but devs really need to test their apps under unfavorable conditions.

mmzeeman · on Dec 22, 2021

You don’t have to design a new protocol. There is an open protocol which works nice for this use case. It can handle dropping connections, has a nice routing mechanism, it scales pretty well, has different QoS levels, pub/sub, request/response. That protocol is... MQTT v5!

Personally I’m surprised it is not used more often by web devs.

mathgladiator · on Dec 23, 2021

MQTT v5 is fine enough until you have to care about reliability at scale. What I have seen happen is that it turns into a slightly better TCP where you have to build your own primitives.

For example, when you SUB, you get a SUBACK if enabled. However, there is nothing like a SUBFIN (subscribe over/finished/closed) to indicate that the subscription is over or failed. In many ways, pub/sub overcommits in many ways. Now, this all depends on what does SUBACK mean? is it durable? tied to a connection?

An interesting foil is something like rsocket. RSocket has a lot of nice things, but it requires an implementation to be damn near perfect. Paradoxically, rsocket lacks the notion of a SUBACK but it has something like a SUBFIN.

A key challenge with a stream is knowing if it is working. With request-response, you can always time out. With a stream, how does one tell the difference between we broken stream and an inactive stream?

thatswrong0 · on Dec 22, 2021

We use this at my place of employment and it’s generally OK except for dealing with reconnects. Also pub/sub request/response was not nearly as smooth as we had hoped and eventually we reverted back to HTTP

mmzeeman · on Dec 22, 2021

We have build a nice support library for it. https://github.com/cotonic/cotonic. It helps that our backend is Erlang. https://github.com/zotonic/zotonic.

throwaway984393 · on Dec 23, 2021

Websockets:

We wanted a connection-oriented streaming protocol API for web browsers, like TCP, but we didn't want to just let web sites make random connections to TCP ports, so instead we took a connectionless protocol and bolted on a connection-oriented streaming protocol that can only use 2 port numbers.

We live in the future now.

austincheney · on Dec 22, 2021

The biggest challenge when migrating from HTTP to any kind of stream is the loss of callback on response, because there is no request/response. There is only push. That means your messaging and message handling becomes far more complex to compensate.

The article mentions that your socket could drop. This isn’t critical so long as you have HTTP as a redundancy.

For any kind of application making use of micro services I strongly recommend migrating to a socket first transmission scheme because it is an insane performance improvement. Also it means the added complexity to handle message management makes your application more durable in the face of any type of transmission instability. The durability means that your application is less error prone AND it is ready for migration to other protocols in the future with almost no refactoring.

deterministic · on Dec 23, 2021

A team at work made the mistake of using WebSockets for a new application instead of old fashioned long polling. That resulted in 3+ months of hell debugging a buggy OSS library and fixing incompatibility issues for different mobile devices. And it broke again when a new OS update was released for iPhone. I did warn them not to use fairly new tech when old battle proven tech would solve the problem equally well. But the shiny new tech was too tempting for them and they paid the price for it. Which yet again confirms my #1 rule of thumb: always use tech that is old and boring unless there is no old tech that can do the job. If you want excitement, perhaps instead do kick boxing. That worked for me.

charcircuit · on Dec 23, 2021

Websockets are old and boring. They have been around for nearly a decade.

deterministic · on Dec 23, 2021

Apparently not old and boring enough.

di4na · on Dec 23, 2021

Maybe old and boring is not the discriminating factor you thought it is then?

deterministic · on Dec 24, 2021

I definitely think it is. However “old” and “boring” are not precise terms. Perhaps a more accurate way to define the rule would be: “Prefer the oldest, battle proven, known to the team tech that can do the job.” For example if long polling and WebSockets are both capable of doing what you need to get done then pick long polling. Because it is older, more used, proven, used by many web applications out there, compatible with everything on the planet, and known to the team.

vikR0001 · on Dec 22, 2021

Just use Meteor and/or Apollo. They have all this stuff figured out.

https://www.meteor.com/

https://www.apollographql.com/

zemo · on Dec 22, 2021

having written a websocket game server that serves hundreds of thousands of simultaneous connections, I find the tone of this article really frustrating, because the author is peddling a lot of FUD about websockets while at the same time building a SaaS websocket server product. Websockets involve different challenges than traditional http servers, but I've always found them to be fun and stimulating.

mr_toad · on Dec 22, 2021

If you’re developing a game you’re probably not dealing with enterprise networks. Enterprise networks have their own rules, only loosely related to the specs.

mathgladiator · on Dec 22, 2021

I wouldn't call it FUD, but lived experience manifesting as requirements and knowing what to avoid. I've been the guy that has to debug reliability issues in streaming services because people just wing it.

I agree that it is fun and stimulating, but you also have to pick your battles and know which problems are worth having.

phtrivier · on Dec 22, 2021

For all the mean things I could say about the language and it's ecosystem, this is one area where elixir/phoenix/channels shine.

quaunaut · on Dec 23, 2021

Curious, what are the mean things? Feel free to email them if you don't want a big thread.

phtrivier · on Dec 25, 2021

I made a (bad) habit of taking all the HN threads about people being in love with Elixir, and being a Grinch ;)

Mostly it's stuff that we did wrong, did not understand, a poor choice of lib, and a strong preference for staticly typed langages with a visual debugger that just works.

Nothing that should prevent you from using or trying it - it's Christmas time, peace on flame wars.

edmcnulty101 · on Dec 23, 2021

Why use a websocket on CP systems where consistency is important? Request/Response seems better?

Even though they CAN be used for CP I always thought Web Sockets were targeted more for AP systems where Availability/Speed was what's more important and if the connection broke and state was lost it wasn't a big deal.

johncolanduoni · on Dec 23, 2021

A websocket doesn’t have much import on CP vs AP. Request/Response is still done over unreliable connections that can’t guarantee exactly once request or response delivery without further application-level mechanisms (retries with idempotency keys, etc.). They’re also used mostly with web clients, and it’s pretty rare that we consider systems where web client partition is allowed to reduce availability for the system as a whole, even if it is CP.

edmcnulty101 · on Dec 25, 2021

Ah that makes sense. Maybe I need to read the article again but it was talking about lost state in the web client due to the web socket which means they're trying to treat the web client as a stateful partition?

So I guess my instinct was why even have state in the browser.

furstenheim · on Dec 23, 2021

To reduce the problem of updating the software you can bring up a second sender and make all new connections point to that one, but don't close the old ones.

If your user sessions are less than the time between deployments, then you won't close any connections

rajangdavis · on Dec 23, 2021

When should you use Web Sockets? I've been interested in the tech but haven't found a use case that needs it. I can see a scenario of live, server pushed updates being a use case but am completely ignorant of the issues the article speaks to.

ch33zer · on Dec 23, 2021

Not really related, but it funny that this and the websocketd post were right next to each other:

https://imgur.com/a/sbHddsd

pshc · on Dec 22, 2021

On the plus side, once you get a good websocket steady state going, the efficiency and latency benefits are nice compared to HTTP1 polling.

Maybe not that big of an advantage anymore with HTTP2 popping off.

tbeseda · on Dec 22, 2021

Archive link with https https://archive.li/R8Nxi

shams93 · on Dec 23, 2021

Sse is under rated for example its stateless and easy to cluster but has many of the benefits of websockets.

bullen · on Dec 23, 2021

Cometstream is much better in every way.

woodruffw · on Dec 22, 2021

Just in case the author is listening, I'll play the grammar nazi: it's "woe unto," not "woe onto."

dang · on Dec 22, 2021

I was just wondering if anyone had pointed that out. Fixed above.

Edit: it probably should be "woe unto". From the King James Bible (surely the canonical source of woe-untos):

  54  woe unto
  24  woe to
   2  woe be unto
   1  woe be to

but "woe be unto" is in there so I guess it's legit enough.

JetAlone · on Dec 23, 2021

They went for woe be unto... Guess they wanted a a less common variant :>

mathgladiator · on Dec 22, 2021

I'll fix when I return to my computer. Thanks

jihadjihad · on Dec 23, 2021

Or, something like "Woe betide all who use WebSockets."