Hacker News new | past | comments | ask | show | jobs | submit login
Jetstream: Shrinking the AT Protocol Firehose by >99% (jazco.dev)
163 points by keybits 9 days ago | hide | past | favorite | 103 comments





Why providing non-compressed version at all? This is new protocol, no need for backwards compatibility. Dictionary may be baked into protocol itself, being fixed for specific version. E.g. protocol v1 uses that fixed v1 dictionary. Useful for replaying stored events on both sides

Jetstream isn't an official change to the Protocol, it's an optimization I made for my own services that I realized a lot of other devs would appreciate. The major driving force behind it was both the bandwidth savings but also making the Firehose a lot easier to use for devs that aren't familiar with AT Proto and MSTs. Jetstream is a much more approachable way for people to dip their toe into my favorite part of AT Proto: the public event stream.

As I understand article, there are new API (unofficial , for now)

- Jetstream(1) (no compression) and

- Jetstream(2) (zstd compression).

And my comment means (1) not really needed, except some specific scenarios


It's impossible to use the compressed version of the stream without using a client that has the baked-in ZSTD dictionary. This is a usability issue for folks using languages without a Jetstream client who just want to consume the websocket as JSON. It also makes things like using websocat and unix pipes to build some kind of automation a lot harder (though probably not impossible).

FWIW the default mode is uncompressed unless the client explicitly requests compression with a custom header. I tried using per-message-deflate but the support for it in the websocket libraries I was using was very poor and it has the same problem as streaming compression in terms of CPU usage on the Jetstream server.


> It also makes things like using websocat and unix pipes to build some kind of automation a lot harder

Would anybody realistically be using those tools with this volume of data, for anything but testing?


It's not that much data. Certainly nothing zstdcat can't handle.

A non-compressed version is almost certainly cheaper for anything local (ex self-hosting your own services that consume the firehose on the same machine or for testing).

There's not really a good reason to do compression if the stream is just going to be consumed locally. Instead you can skip that step and broadcast over memory to the other local services.


It could be a flag, normally disabled. Also, i'm not sure about "cheaper" side, since disk ops are not free, maybe uncompressing zstd IS cheaper than writing, reading huge blobs from disk, exchanging info between apps

This isn’t compression, they are throwing features of the original stream out.

We're discussing here compiled output, plain json and the same json but zstd-compressed

I gotta say, I am not very excited about "let's throw away all the security properties for performance!" (and also "CBOR is too hard!")

If everyone is on one server (remains to be seen), and all the bots blindly trust it because they are cheap and lazy, what the hell is the point?


Centralization on trusted servers is going to happen but if they speak a common protocol, at least they can be swapped out. For JetStream, anyone can run an instance, though it will cost them more.

It’s sort of like the right to fork in Open Source; it doesn’t mean people fork all the time or verify every line of code themselves. There’s still trust involved.

I wonder if some security features could be added back, though?


If you’re going to try data reduction and compression, always try compression first. It may reveal that the 10x reduction you were looking at is only 2x and not worth the trouble.

Reduction first may show the compression is less useful. Verbose, human friendly protocols compressed win out in maintenance tasks, and it’s a marathon not a sprint.


As a corollary, if you try to be too clever with your data reduction strategy, you might walk yourself into a dead end / local maximum by making the job of off-the-shelf compression algorithms more difficult.

> If everyone is on one server (remains to be seen)

Even internally, BlueSky runs a number of "servers." It's already internally federated. And you can transfer your account to your own server if you want to, though that is very much still beta quality, to be fair.

You're not really "on a server" in the same sense as other things. It's closer to "I create content addressable storage" than "The database on this instance knows my username/password."


And if you're curious what your Bluesky server is:

  DID=$(curl -s https://<username>.bsky.social/.well-known/atproto-did)
  curl https://plc.directory/$DID | jq '.service[0].serviceEndpoint'
Or if you're using a custom domain (a la example.com), you can get your DID from the DNS:

  dig +short _atproto.example.com TXT

Thanks for this! I'd never checked. Turns out I'm on https://morel.us-east.host.bsky.network. I do want to host my own PDS someday, but then I'd be responsible for keeping it up...

Indie hosted PDS is on my list as well.

I'm also currently trying to understand the tradeoffs for did:plc more. It's unclear to me just how centralized it is. Will it always require a single central directory, or is it more like Certificate Transparency? Based on what I've heard about the recover process, I believe it's the latter, but I still need to dig into it more.


I don't know myself, but given that the discussion there, from what I've heard, is along the lines of "should be moved into an independent foundation," my assumption is that it will always require a directory. But this is probably the part of the tech stack that I know the least details about.

We've seen this over and over. If you do things the "right" way devs just don't show up because it's too much work.

Depends on what the right way is to be honest. If the "right way" is moving the project into zero-knowledge proofs territory, it does push out a lot of the developers. It is not that cryptography in general is pushing people out in that case but ZKP complexity is.

Or why can't one verify a msg on it's own isolated from all of the other events on the PDS.

The full Firehose provides two major verification features. First it includes a signature that can be validated letting you know the updates are signed by the repo owner. Second, by providing the MST proof, it makes it hard or impossible for the repo owner to omit any changes to the repo contents in the Firehose events. If some records are created or deleted without emitting events, the next event emitted will show that something's not right and you should re-sync your copy of the repo to understand what changed.

Nice feat!

I wonder if a rewrite of this in C++ would even bump further the performance and optimise the overall system.


Or in rust haha

The "bring it all home" screenshot shows a CPU Utilization graph, and the units of measurements on the vertical axis appears to be milliseconds. Could someone help me understand what that measurement might be?

The graph is labeled - CPU seconds per second.

Missed that, thank you.

>Before this new surge in activity, the firehose would produce around 24 GB/day of traffic. After the surge, this volume jumped to over 232 GB/day!

>Jetstream is a streaming service that consumes an AT Proto com.atproto.sync.subscribeRepos stream and converts it into lightweight, friendly JSON.

So let me get this straight. if you did want to run Jetstream yourself you'd still need to be able to handle the 232 GB/day of bandwidth?

This always has been my issue with Bluesky/AT Protocol, For all the talk about their protocol being federated, It really doesn't seem realistic for anyone to run any of the infrastructure themselves. You're always going to be reliant on a big player that has the capital to keep everything running smoothly. At this point I don't really see how it's any different then being on any of the old centralized social media.


> It really doesn't seem realistic for anyone to run any of the infrastructure themselves. You're always going to be reliant on a big player that has the capital to keep everything running smoothly

I run a custom rust-based firehose consumer on the main firehose using a cheap digitalocean droplet and don't cross 15% CPU usage even during the peak 33 Mb/s bandwidth described in this article.

The core team seem to put a lot of intention into being able to host different parts of the network. The most resource intensive of which is the relay which produces the event stream that Jetstream and others consume from. One of their devs did a breakdown of that and showed you could run it at $150 per month which is pricey but not unattainable with grants or crowdfunding https://whtwnd.com/bnewbold.net/entries/Notes%20on%20Running...


I can attest secondhand that it's possible to run a relay for ~€75/month, which is well within the range of many hobbyists.

Old social media never gave full access to the firehose so there’s a pretty big difference.

If you want large scale social networks, you need to work with a large scale of data. Since federated open queries aren’t feasible, you need big machines.

If you want a smaller scale view of the network, do a crawl of a subset of the users. That’s a perfectly valid usage of atproto, and is how ActivityPub works by nature.


>Old social media never gave full access to the firehose so there’s a pretty big difference.

That is good, but it's still a centralized source of truth.

>If you want large scale social networks, you need to work with a large scale of data. Since federated open queries aren’t feasible, you need big machines.

Thats just simply not true. ActivityPub does perfectly without the need of any bulky machine or node acting as a relay for the rest of the network. Every single ActivityPub service only ever interacts with other discovered services. Messages aren't broadcast through a central firehose, they're sent directly to who needs to receive them. This is a fundamental difference with how both protocols work. With ATProto you NEED to connect to some centralized relay that will broker your messages for you. With ActivityPub, there is no middle man, Instances just talk directly to each other. This is why ActivityPub has a discovery problem by the way, but it's just a symptom of real federation.

>and is how ActivityPub works by nature.

It's not. See Above.


> That is good, but it's still a centralized source of truth.

It's not. It's a trustless aggregator. The PDSes are the sources of truth, and you can crawl them directly. The relay is just an optimization.

> Messages aren't broadcast through a central firehose

ATProto works like the web does. People publish information on their servers, and then relays crawl them and emit their crawl through a firehose.

> ActivityPub does perfectly without the need of any bulky machine or node acting as a relay for the rest of the network

ActivityPub doesn't do large scale aggregated views of the activity. The peer-wise exchanges means that views get localized; this is why there's not network-wide search, metrics, or algorithms.

> This is why ActivityPub has a discovery problem by the way,

right

> but it's just a symptom of real federation.

"real" ?


>The PDSes are the sources of truth, and you can crawl them directly. The relay is just an optimization.

This is such a massive understatement. the relay is the single most important piece in the entire Bluesky stack.

Let me ask you this, is it possible for me to connect to a PDS directly, right now, via the bluesky app? or is this something that will be possible in the future?

>ATProto works like the web does. People publish information on their servers, and then relays crawl them and emit their crawl through a firehose.

>ActivityPub doesn't do large scale aggregated views of the activity.

So are relays really just an optimization or an integral part of how ATProtocol is supposed to work? ActivityPub doesn't require relays to function properly. This is why I say it's real federation. You can't truly be federated if you require centralization.


> Let me ask you this, is it possible for me to connect to a PDS directly, right now, via the bluesky app?

Well, yes, that's what you do when you log in. If you open devtools you'll see that the app is communicating with your PDS.

> So are relays really just an optimization or an integral part of how ATProtocol is supposed to work?

I think the issue here is that you're mentally slicing the stack in a different way that atproto does. You expect each node the be a full instance of the application, and that the network gets partitioned by a bunch of applications exchanging peerwise.

A better mental model of atproto is that it's a network of cross-org microservices. https://atproto.com/articles/atproto-for-distsys-engineers gives a decent intuition about it.


> If you open devtools you'll see that the app is communicating with your PDS.

That's pretty cool. Does bsky.app do a DID resolve or does it have a faster back channel for determining PDS addresses for bsky.social?


> You can't truly be federated if you require centralization.

I’m not so sure: isn’t the certificate transparency log a pretty good example of a federated group of disparate members successfully sharing a view of the world?

That requires some form of centralization to be useful (else it’s not really a log, more of a series of disconnected scribbles), and it’s definitely a true federated network.


> > If you want large scale social networks, you need to work with a large scale of data. Since federated open queries aren’t feasible, you need big machines.

> Thats just simply not true.

> [snip]

> This is why ActivityPub has a discovery problem by the way, but it's just a symptom of real federation.

You're actually agreeing with them! The "discovery problem" is because "federated open queries aren't feasible".

> With ATProto you NEED to connect to some centralized relay that will broker your messages for you.

You can connect to PDSs directly to fetch data if you want; this is exactly what the relays do!

If you want to build a client that behaves more like ActivityPub instances, and does not depend on a relay, you could do so:

- Run your own PDS locally, hosting your repo.

- Your client reads your repo via your PDS to see the accounts you follow.

- Your client looks up the PDSs of those accounts (which are listed in their DID documents).

- Your client connects to those PDSs, fetches data from them, builds a feed locally, and displays it to you.

This is approximately a pull-based version of ActivityPub. It would have the same scaling properties as ActivityPub (in fact better, as you only fetch what you need, rather than being pushed whatever the origins think you need). It would also suffer from the same discovery problem as ActivityPub (you only see what the accounts you follow post).

At that point, you would not be consuming any of the _output_ of a relay. You would still want relays to connect to your PDS to pull data into their _input_ in order for other users to see your posts, but that's because those users have chosen to get their data via a relay (to get around the discovery problem). Other users could instead use the same code you're using, and themselves fetch data directly from your PDS without a relay, if they wanted to suffer from the discovery problem in exchange for not depending on a relay.


It doesn't change the fact that if someone were to do that, it wouldn't be supported by anyone let alone the main bluesky firehose. I think it's pretty disingenuous to just say "You can do it" when what your suggesting is so far off the intended usage of the Protocol that it might as well be a brand new implementation. As a matter of fact, people DO already do this. they use ActivityPub and talk to bluesky using a bridge.

The core of the issue is, Bluesky's current model is unsustainable. The cost of running the main relay is going to keep rising. The barrier to discovery keeps getting higher and higher. It might cost 150$/month now to mirror the relay, but what's going to happen when it's 1000$?


Blue Sky supposedly has 5.5 million active users.

https://en.wikipedia.org/wiki/Bluesky

By your own numbers it averages 2.7 MB/s. This is manageable with a good cable internet connection or a small vps. This is a small number for 5.5 million active users.

What happens if it expands to 10 times its current active users? Who knows, maybe only the 354 million people around the world with access to gigabit broadband can run a full server at home and the rest of the people and companies that want to run full servers will have to rent a vps.

https://www.telecompetitor.com/gigabit-availability-report-1...

The point here is that this is not a practical problem. How many of these servers do you really need?


AP also has relays as a part of the architecture, just less well documented https://joinfediverse.wiki/index.php?mobileaction=toggle_vie...

They may share a name, but they both work in very different ways.

Based on the article OP runs his Jetstream instance with 12 consumers (subsets of the full stream if I understand correctly) on a $5 VPS on OVH

That's only about 2.7 MB/s on average.

If someone wants to run a server, they probably would pay for a VPS with a gigabit connection, which would be able to do 120 MB/s.

You might need to pay for extra bandwidth, but it is probably less than a night out every month.


Sometimes I think all of this could have been avoided if people knew how to use RSS.

I thought this was going to be about NATS Jetstream, but it is not.

https://docs.nats.io/nats-concepts/jetstream


I thought this was about BlueSky using NATS!

Why is this being downvoted? Seems like a valid concern to raise if you find two pieces of software somewhat having the same functionality.

It has the same name, not the same functionality. I am not a downvoter... but it's probably because reading a few sentences of this blog post would reveal what it is.

Perhaps the complaint is more about project namers spending zero time checking for uniqueness.

A problem shared with AT as well.

Was I being downvoted?

It wasn't even a criticism, just an observation for anyone else who was thinking the same or was interested in another popular project with a similar name (and seemingly similar functions? didn't look too hard).

Naming things is hard and we all kinda share one global tech namespace, so this is gonna inevitably happen.


I thought this was going to be a strange read about the Hayes command set[1] at first glance.

1: https://en.m.wikipedia.org/wiki/Hayes_AT_command_set


My guess was NATS Jetstream [0].

[0] https://docs.nats.io/nats-concepts/jetstream


I thought exactly the same

I'm never not going to look for Hayes command set topics when people talk about BlueSky.

You're not the only one. I don't get why they couldn't have named it something that wasn't very similar to something already around for several decades, or at least insist on the shortened ATproto name (one word, lower case p). Sure, in practice, no one will actually confuse them, but that could be said for Java and JavaScript.

> Before this new surge in activity, the firehose would produce around 24 GB/day of traffic.

The firehose is all public data going into the network right?

Isn't that pretty tiny for a worldwide social network?

And the fact one country can cause a 10x surge in traffic also suggests its worldwide footprint must be tiny...


The 10x surge in traffic was us gaining 3.5M new users over the course of a week (growing the entire userbase by >33%) and all these users have been incredibly active on a daily basis.

Lots of these numbers are public and the impact of the surge can be seen here: https://bskycharts.edavis.dev/static/dynazoom.html?plugin_na...

Note the graphs in that link only show users that take a publicly visible action (i.e. post, like, follow, etc.) and won't show lurkers at all.


24GB/day is very much a "we could host this on one server with a bit of read caching" scale.

Every year or so, you plop in one more 10TB SSD.


Yes the actual record content on the network isn't huge at the moment but the firehose doesn't include blobs (images and videos) which take up significantly more space. Either way, yeah it's pretty lightweight. Total number of records on the network is around 2.5Bn in the ~1.5 years Bluesky has been around.

I aggregated some stats when we hit 10M users here - https://bsky.app/profile/did:plc:q6gjnaw2blty4crticxkmujt/po...


> The firehose is all public data going into the network right?

It's the "main subset" of the public data, being "events on the network": for the Bluesky app that's posts, reposts, replies, likes, etc. Most of those records are just metadata (e.g. a repost record references the post being reposted, rather than embedding it). Post / reply records include the post text (limited to 300 graphemes).

In particular, the firehose traffic does _not_ include images or videos; it only includes references to "blob"s. Clients separately fetch those blobs from the PDSs (or from a CDN caching the data).


Given that “AT Protocol” already has a definition in IT that’s as old as OP’s grandma, what is this AT Protocol they are talking about here?

Introduce your jargon before expositing, please.


I wondered something similar when I clicked the link: "who is still using enough AT commands that a compressed representation would matter, and how would you even DO that?" But this is clearly something else.

Anyone who writes software that uses GSM modems for example. Like in embedded systems.

Oh, for sure - I've done some of that myself - but I would never associate the word "firehose" with such low-powered systems!

Iridium satellite modems too.

I’m shocked those things are still up there. That project tried to fail so many times.

"those things" meaning the all-new Iridium constellation, launched in the late 2010s?

https://en.wikipedia.org/wiki/Iridium_satellite_constellatio...


And then I had to look up CBOR too, which at least is a thing Wikipedia has heard of. I mostly use compressed wire protocols and ignore the flavor of the month binary representations.

The article kind of assumes you know what it is in order to be interested in it, but it's the protocol used by Bluesky instead of ActivityPub.

Which makes it a bad submission for HN. If you want exposure, prepare for it.

you sound like you're a blast to be around

You mean the Hayes AT command set? We didn't call it a protocol back in the say.

But it still is a protocol.

BlueSky is mentioned. The ATProto thing has been discussed often in relation to an opensource social media protocol and all of this started from Jack Dorsey and Twitter. If it helps, it's thoroughly documented on Wikipedia.

https://en.wikipedia.org/wiki/Bluesky

And the protocol is documented here:

https://atproto.com/

Hope this helps with the confusion.


The Wikipedia article does a pretty good job at giving an overview for it.

https://en.wikipedia.org/wiki/AT_Protocol


Had same confusion. Wondered why it would need compression...

Was expecting Nats Jetstream but this is also cool

Was expecting the Hayes modem command language.

[flagged]


For real. It's 2024, which in my millennial mind is squarely 'the future', and I'm writing bespoke UART drivers and command parsers for a protocol that's older than I am.

[flagged]


...if you're doing this on a 5 dollar OVH VPS as a solo developer where you don't control all pieces of the puzzle?

[flagged]


This is top-to-bottom 100% nonsense composed out of imagined facts. Please do not contribute this.

State what's not factual instead of just saying it's not factual. Otherwise I can easily say your statement is 100% top-down, in-and-out, all-around, unsubstantiated.


Meanwhile back in reality: Jack Dorsey being involved from the beginning

So what? Others simply don't attach the same importance to this that you do. Last time I looked he was all in on Nostr, which doesn't fit with your theory at all.

Where does Gaddafi come into this? That seems like a complete non-sequitur with the previous sentence talking about people involved in Start-ups in the early 2010s.


Gaddafi was overthrown as part of the Arab Spring. He wasn’t a guy involved in tech startups in the early 2010s hence my confusion as to how he was related to Aaron Swartz.

Ah you misread, I'm saying anyone reading the comment who was active in startups during that time would know about those events.

I'm just popping in here to say this (and BlueSky and atproto) are two of the coolest technical feats in today's tech world.

Server-Sent Events (SSE) with standard gzip compression could be a simpler solution -- or maybe I'm missing something about the websocket + zstd approach.

SSE Benefits: Standard HTTP protocol, Built-in gzip compression, Simpler client implementation


Well-configured zstd can save a lot of bandwidth over gzip at this scale without major performance impact, especially with the custom dictionary. Initialising zstd with a custom dictionary also isn't very difficult for the client side.

As for application development, I think web socket APIs are generally exposed much better and used much easier than SSEs. I agree that SSEs are a more appropriate technology to use here, but they're used so little that I don't think the tooling is good. Just about every language has a dedicated websocket client library, but SSEs are usually implemented as a weird side effect of a HTTP connection you need to keep alive manually.

The stored ZSTD objects make sense, as you only need to compress once rather than compress for every stream (as the author details). It also helps store the data collected more efficiently on the server side if that's what you want to do.


I don't have an understanding of SSE in depth, but one of the points the post is arguing for is compress once (using zstd dictionary) and send that to every client.

The dictionary allows for better compression without needing a large amount of data, and sending every client the same compressed binary data saves a lot of CPU time in compression. Streams, usually, require running the compression for each client.


I find it baffling that the difference in cost of serving 41GB/day vs 232GB/day is worth spending any dev time on. We're talking about a whopping 21.4Mbps on average, which costs me roughly CAD$3.76/month in transit (and my transit costs are about to be cut in half for 2 x 10Gbps links thanks to contracts being up and the market being very competitive). 1 hour of dev time is upwards of 2 years of bandwidth usage at that rate.

A thing I have admired about the BlueSky development team is that they're always thinking about the future, not just the present. One area in which this is true is the system design: they explicitly considered how they would scale. Sure, at the current cost, this may not be worth it, but as BlueSky continues to grow, work like this will be more and more important.

Also, it's just a good look towards the community. Remember, ideally not all of ATProto's infrastructure is run by BlueSky themselves, but by a diverse set of folks who want to be able to control their own data. These are more likely to be individuals or small groups, not capitalized startups. While the protocol itself is designed to be able to do this, "what happens when BlueSky is huge and running your own infra is too expensive" is a current question that some folks are reasonable skeptical about: it's no good being a federated protocol if nobody can afford to run a node. By doing stuff like this, the BlueSky team is signaling that they're aware of and responsive to these concerns, so don't look at it as trying to save money on bandwidth: look at it as a savvy way to generate some good marketing.

That's my take anyway.

EDIT: scrolling down, you can see this kind of sentiment here: https://news.ycombinator.com/item?id=41637307


The current relay firehose has more than 250 subscribers. It's served more than 8.5Gbps in real-world peak traffic sustained for ~12 hours a day. That being said, Jetstream is a lot more friendly for devs to get started with consuming than the full protocol firehose, and helps grow the ecosystem of cool projects people build on the open network.

Also, this was a fun thing I built mostly in my free time :)


Also it's just not those concrete 190GBs a day. It's the 6x traffic you can fit in the same pipe :D

Yeah exactly! The longer we can make it without having to shard the firehose, the better. It's a lot less complex to consume as a single stream.

Also, being some sort of streaming firehose with the same data for everyone (if I understood what this is in a quick read) I guess if it makes any sense to do any kind of P2P/distributed transfer of it to ease the load...

It's a benefit on the receiving side too. And it has ecological benefits. Nothing to sneeze at.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: