Streaming data to clients at the edge (apps or services) is a hard problem. We built an approach that keeps Postgres at the center and allows clients to consume a stream of data that is already in Postgres without any additional moving parts.
This works with read replica style scaling or with Postgres flavours that support scaling out easily (eg: Cosmos Postgres, Yugabyte & cockroach coming soon).
I think there’s a huge unacknowledged gap in the database industry for good streaming products.
If we are building a report or dashboard that we pull up a few times a day then a pull based model where we query the database on page load is fine.
For almost anything else such as an app, a microservice, an alerting system, a web page, a dashboard, we want to be able to update it in near real time for the user experience. Receiving a stream of query results is by far the easiest way to do this.
Polling is obviously a poor interim solution.
I think streaming will be a huge story in data over the next decade. The products are coming through now which is a start.
Materialize gives a vaguely PostgreSQLish user land on top of this idea. Obviously Postgres has LISTEN/NOTIFY which are fine for most places you need a queue and push updates but the key is incremental view updates so you get instant results on complex queries.
I'm generally pretty happy to pay for open source software, but licensing like this is just too risky. I need to be able to experiment with something at scale, in production, before I start paying someone.
Free for 1 node, paid after. Code reverts to permissive license after 4 years. Think this is fair. you can bootstrap for free and only pay if it works out
I wouldn't quite say that this need is unacknowledged, most DBs have some kind of streaming interface for actual data contents (not queries, though), which for instance are leveraged by Debezium to expose change event streams in one uniform event format (disclaimer: I work on Debezium). Going foward, I expect that these capabilities will be substantially built out, also providing streaming query support (incrementally updated materialized views), not being locked in separate licensing layers or a more or less complex to use after-thought, but a primary part of database offerings, just as SELECT today.
Streaming has been the story for the past couple of decades already. But once you accept the value of an event-first architecture, the benefits of traditional RDBMSes become much weaker.
It’s been discussed for a long time, but the ecosystem is still spotty. Many databases have some kind of “live query” feature for instance but with limitations or not intended for production use.
A lot of work has taken place in the Kafka, Flink, DataFlow ecosystem but that still leaves a lot of work for the developer over a simple subscribe to query results.
I do think a lot of work has been done, but it all needs to move up a few levels of abstraction.
> A lot of work has taken place in the Kafka, Flink, DataFlow ecosystem but that still leaves a lot of work for the developer over a simple subscribe to query results.
Personally I find it much easier to write the code explicitly than try to understand what a query planner is doing, especially if performance is relevant. (That's not to say there's not plenty of room for improvement in the streaming world - but I'd rather have a helper library that I can use on top of the low-level API, than have to go through a parser and planner for every query even when I know exactly what I want to do) But I seem to be an outlier in this regard.
Our focus (Hasura) is on the last mile so that innovations on the data side (eg: materialize, ksql, timescale continuous aggregates) are “just obvious” to start building applications and services against.
maybe if you're a really big company with petabytes of data but I wouldn't' say its worth the operational complexity for 90% of tech companies out there. Seriously, design your schema so you can have tables optimized for aggregate functions. you can use materialized views or have them populate via triggers. Either is still going to be an order of magnitude less work that dealing with all the edge cases that are ging to come up when you try to build a distributed data pipeline. Thats a death by thousand cuts for any small to medium startup that doesn't already have a team of engineers experienced in making such systems work.
Plus you can stream db changes from postgres to kafka for those edge cases where you really need it
TLDR: if you're in a startup and thinking of building a distributed system.. DONT. stick with a monolith and spin out services as NEEDED.
> maybe if you're a really big company with petabytes of data but I wouldn't' say its worth the operational complexity for 90% of tech companies out there. Seriously, design your schema so you can have tables optimized for aggregate functions. you can use materialized views or have them populate via triggers. Either is still going to be an order of magnitude less work that dealing with all the edge cases that are ging to come up when you try to build a distributed data pipeline. Thats a death by thousand cuts for any small to medium startup that doesn't already have a team of engineers experienced in making such systems work.
Funny, my experience is the exact opposite. Materialized views and triggers are death by a thousand cuts with all their edge cases. Whereas if you just use Kafka for everything from day 1 then everything works consistently and you have no problems.
So who's managing the kafka cluster? are you running scheduled backups of it? what happens if the system is down while ingesting data? I can only imagine you have someone thats already veery experienced in kafka if you find that easier to manage than an rdbms which already has established hosted services available.
Running Kafka is fiddly but a lot easier than running a true master-master RDBMS setup (which none of my employers have ever managed; at best some have had a read replica and an untested shell script from five years ago that supposedly promotes that read replica to master if the master goes down). Backups are the same problem either way (and, while it's not a full replacement for actual backups, the fact that Kafka is clustered means you're protected from some failure scenarios out of the box in a way that you aren't with something like Postgres). And there are plenty of established hosted kafka services if that's your preferred approach.
So the question is: Why have two copies of your data, two products to learn and monitor and operate, write boilerplate to move data between the DBs, etc.?
A "message queue" comes down to being another index on your table/set of tables ordered by a post-commit sequence number. These are things all SQL DBs have already, it just lacks a bit of exposing/packaging to be as convenient to use as a messaging queue.
While I love ingenuity of developers at Hasura (because I've been personally through these scaling challenges), I always get a gag reaction with GraphQL. I've honestly tried hard to digest it, and I can tell you at large scale where single DB won't cut it, you would either need to develop a large federation layer like [Netflix](https://netflixtechblog.com/how-netflix-scales-its-api-with-...), or just rip-it out. Streaming might elevate the problem a little, but I real problem still lurks under the hood. The fact that front-end community wants to fit everything under GraphQL bothers me, because every backend developer knows that a single tool/technology is usually not the best tool for solving all problems of your life. Remember the golden words, THERE IS NO SILVER BULLET!
If database capacity is your main concern and not independent schema deployability, then federation is overkill. You can just connect to whichever databases contain your data in your resolvers within a single service.
You have to be at pretty massive scale before federation becomes necessary and by then (if ever) your frontend teams have experienced benefits that are pretty much miraculous. The reason frontend wants to fit as much as possible into it is because it's vastly better than what came before it unless you have a 95th percentile org that is really doing an outstanding job managing the API via other means.
I think graphql with a single db is an odd thing to gain popularity - as the raison d'être for graphql was to federate and proxy different, heterogeneous data sources (databases, json/rest services, soap, other graphql services...).
Speaking of Netflix - I think they had an alternative Api federation service that used some clever tricks with json string vs number keys to allow for alternating http put/get - and through that leverage http level caching. But I can't find the the link...
GraphQL raison d’être was not to federate and proxy different, heterogeneous data sources. It has initially been developed as a better API for the Facebook monolith. I asked one of GraphQL’s coauthors directly: https://twitter.com/ngrilly/status/1317415232717860866?s=46&....
Interesting - I didn't even know that there was a monolith at Facebook [ed: beyond there being reports that they use a large monorepo for code]. I stand corrected, thank you.
This doesn't directly address if graphql at Facebook federated different data sources (eg graph and sql databases) - but I suppose not?
I don't have any "insider" information regarding the data sources used by the Facebook monolith, written in PHP, when GraphQL has been developed, but I would expect it to talk to multiple/different datastores, and not just MySQL. Then yes, it could be argued that it has been used to federate multiple data stores, even though it was already the case before using GraphQL :)
The main thing I see it being used for is field selection and model nesting. From an end user perspective, it's pretty nice for this I think. Certainly nice that if I know it's a graphql endpoint I've already got a fairly solid idea about how to query it.
FD: I work at hasura and work with users/customers.
Fair point about the DB scaling but not sure if everyone is going to run into this issue. Also, lots of solutions are emerging for this specific problem (with different trade-offs of course) like distributed databases (crunchy, YugaByte, Spanner, etc.). Most folks I work with get by with a reasonably sized DB and some read replicas.
In my experience, if you have more than a small complexity to your data, and you're a medium scale business (or are going in that direction), you're going to end up with this issue. And then it becomes really painful to get out of the situation as migrating data and splitting things up is a real PITA.
While you are correct, not everyone is going to end up with this issue, those that are thinking of getting to a medium sized business should be working to avoid it, which unfortunately means solutions such as hasura lose value. It would be good to see more ability to collect data from multiple sources (please reply and correct me if you already do this, I'm not super familiar with the service).
With Hasura you can create a federation of databases and expose them as graphql or rest endpoints. You can also wrap your existing REST/Graphql services. Almost like API gateway, place where you unify your services, manage access / row-level permissions etc.
I've been thinking about the problem and the more I think about it, I tried many different alternatives, it's best to have plain SQL being sent for queries.
The problem there is whi has the right to run which queries?
It was never the solution to begin with, IMHO.
It's lacking a permissions or let's say authz framework.
It's only half the solution. And it's too complicated and there's no real standard, as in real world standard. Everyone has their own little soup cooking, because the manifest is incomplete and always evolving.
I've been using Hasura in production for small commercial projects deployed on AWS and I've been positively impressed by the stability and the speed up - it makes spinning up a graphQL backend with row-level security straightforward.
Just curious: my understanding is Hasura kind of discourages using RLS in lieu of their own access control layer[1]. Did you consider pros/cons of either approach?
People do use RLS for app user-level access control. It has been advocated by PostgREST for years [1] and also widely practiced by the PostGraphile community [2]. Hasura distinguishes itself by not actively advocating db-level access control. From what I recall, the first versions of Hasura were created at a time when Postgres' RLS was still very slow, but it's not clear to me why they still push their own solution. (Not saying there can't be a good reason for it, just not clear to me what it is, and I'd rather default to using functionality built into the database.)
But for column level permissions views are still needed, right? This isn't the case with hasura. (Of course hasura's approach has own limitations).
> Anyone accessing the generated API endpoint for the chat table will see exactly the rows they should, without our needing custom imperative server-side coding.
That looks like a bad joke. It is imperative, but now in the database. I'm not sure that it is better. Hasura's approach is declarative.
From my point of view, postgres' security model is still not suitable for users. It is more complex, imperative, and therefore more error prone. Hasura's approach is still not perfect, but a combination of postgres and hasura functionality make a huge difference.
Seeing some feedback on GraphQL - Hasura has had support for converting templated GraphQL into RESTish endpoints (with Open API Spec docs if needed). We are planning to do the same for this streaming API as well - does anyone have good examples of existing REST/RESTish endpoints that something similar?
We’ve been using hasura at work, but we’ve stopped using it for everything other than subscriptions in favour of hand written rest apis. The problem for us wasn’t really graphql itself, but the fact that the client app was determining the query. If the client could request a “named query”, that was then determined by the backend (perhaps via a web hook?)then we’d have been able to use hasura more.
I’ve been a big hasura user for a while. Give their RESTified endpoints a go, solves this issue for you and still gives you all and access control goodness and subscriptions under one roof.
We have written a post[1] on building a real-time chat app with Streaming Subscriptions on Postgres. It gives a quick overview of the architecture used and how you can leverage the API on the client side with AuthZ. There’s a live demo that you all can try out.[2]
Has anyone come across neat tools for load-testing streaming APIs?
We used https://github.com/hasura/graphql-bench and a set of scripts to monitor runtime characteristics of Hasura and Postgres, and reconciliation to make sure data was received as expected and in-order.
But would love to see if there's other tools that folks have come across!
Disclaimer - I work on Hasura, so I won't comment too much on which is better.
They're all similar flavors of producing realtime results - which take similar, but different, methods to their approach.
My understanding (please feel free to correct me if I'm wrong):
- Supabase Realtime uses WAL.
- Hasura Streaming Subscriptions uses a query which will be append-only (could be a sort-by or also WAL).
- Hasura Live Queries uses interval polling, refetching, and multiplexing.
- Supabase uses Postgres RLS for authorization, while Hasura uses an internal RLS system which composes queries (which allows for features like the multiplexing above).
- All 3 use websockets for their client communication.
Haven’t looked at the internals or the API, but I think for starters, Hasura is designed for working on any Postgres including ones that have existing data, especially modern Postgres offerings (crunchy, alloy db, neon, Aurora, cosmos Postgres, timescale, yugabyte)
I'm a (very satisfied) hasura user and trying to figure out which realtime interface to pick for a chat component. Can you explain the difference between the two and how/when to pick the right one?
You would benefit from using a streaming subscription. If you use a live query, you have to manually maintain a "cursor" (ie last message you fetched) and update your graphql subscription accordingly every time a new message is received or sent. If you don't you'll be refetching the same messages repeatedly, and they'll just grow as more messages are added.
With a streaming subscription this is all taken care of, you simply do something like this:
1) Load the last 50 messages (or whatever number of existing messages suits you)
2) Run your streaming subscription using the most recent of the messages from above as you cursor
3) When you send a message, do nothing and let the subscription handle getting the message you just sent
It sounds like the stream must be an append-only table. This is awkward - I would expect streaming updated results for a query. If I want clients to refetch changed record in real time, do I still need to build that on top of this stream primitive? Like, I stream an audit log style table, and then refetch any IDs mentioned in the stream separately?
I work at Hasura. You don't need an append-only table always. You may have to add a monotonic ID to your table which increments on DMLs (and use it as the cursor for streaming) and this can be done using a post-DML trigger in Postgres. Also, streaming works over tables with relationships as well so you can get related data in the same stream as well.
Can you go into more detail about what you mean by "duplication of data"? I worked on a team whose product uses Hasura; my understanding is that data still comes directly from Postgres, even if Hasura is sitting in the middle. Where is the data being duplicated?
Nope, they're saying that products like hasura resolve the issue with the duplication of data. Or to put in another way, you don't need multiple stores like kafka with redundant data storage. Just use hasura.
This works with read replica style scaling or with Postgres flavours that support scaling out easily (eg: Cosmos Postgres, Yugabyte & cockroach coming soon).