Poor [Cloud Tasks](https://cloud.google.com/tasks), they're the actual GCP message queue, but everyone forgets they exist and use Pub/Sub instead.
What's funny is Pub/Sub is a fundamentally different model from message queues: queues are normally tightly coupled, meaning you enqueue a message to be processed by a specific system. Pub/Sub on the other hand is fundamentally loosely coupled: the publisher doesn't care who (if anyone) gets the message.
(I've heard "orchestration" vs "choreography" to describe this dichotomy, but I can't say I'm a fan of the jargon.)
Pub/Sub is meant to follow the Kafka "insane amounts of data" firehose pattern. Everyone thinks they need that scale unfortunately and skips over Cloud Tasks.
The Kafka pattern isn't about the size of your data, it's about accommodating heterogeneous consumers and easy fan-out. I don't love the ampq exchange model because it adds state to the middle of the queue.
Hey, author here, feels really nice to be HN famous (as defined by your stuff being on front page without you posting it). AMA and feedback welcome I guess.
Wanted to write a book on this topic over the pandemic lockdown, but then I felt it could just be a post. Let me know if anyone thinks it’s worth a full book. Maybe more exploration and demos of the options, reference designs etc.
Really useful info at just the right level for me, and coincidentally at just the right time! Thanks!
One aspect I'm interested in is what happens when senders and receivers are mismatched in throughput, with the worst case being a too-fast producer causing queues to get "backed up". Do most of the tools you described have decent diagnostics/monitoring for this? Are there design-stage approaches to avoid this completely, even if demand varies (predictably or unpredictably)? Any light you could shed here, or in the post, would be appreciated.
Depends on the queue... I've seen SQS queues holding millions of messages, and think I've heard of a billion plus somewhere - not sure other systems handle that kind of pressure.
In terms of monitoring, yeah, most systems will offer that... On AWS for example we setup cloudwatch alarms that post in our team chat / email when the number of messages waiting in certain queues reaches N, and we go see what's up. AWS also has autoscaling options with cloudwatch alarms, so some consumer services are configured to automatically add more capacity when there are more then N messages in the queue.
this is known as backpressure in the biz. your queues eventually have to start returning errors as they get backed up, and those errors need to be reflected in the producers and ultimately fed up to the end user for feedback. no big deal
If you want to learn about Message Queues firsthand (and you should if you care about CS) I recommend implementing some with ZeroMQ. (I love the joke in the article about it being "zero effort" xD) There are many, many forms of queues, and you can implement most of them with ZeroMQ. There's tons of examples in many different languages, and lots of guides and docs to help you. Even considering all that, it will still be a pain to get the queues working, and that pain will help reinforce the purpose behind different queue architectures, and why it's so nice to have COTS message queues.
That said, if you want a customizeable FOSS message queue stack, use NATS. I'm not aware of a better, more complete solution designed for custom integration.
I never used ZeroMQ (visited their webpage many many times though) but I totally agree that going through the pain, struggles and subtleties of message queuing is a very valuable and important learning experience.
I had those experiences with RabbitMQ around 10 years ago when I pushed the team I worked for towards a message queue solution to scale our distributed test system. It turned out successful but it took quite a while :)
On an intuitive level I love message queues. I like how they loosen the coupling between components and the amount of flexibility they provide. A few years ago I even built quite a large system production with RabbitMQ (a scheduling, monitoring and reporting system for distributed tests).
However, I find it extremely difficult to reason about them. What do I mean with that? I mean all the nitty gritty details like performance, bottlenecks, crashing participants, network issues, queues filling up, protocol issues that lead to message cascades and DDoS etc.
I hear you. I've owned a backend service using Azure Service Bus for several years, which recently ran into problems when a new deployment happened. It turns out they the default settings on the bus causes messages to be re-tried ten times if they take longer than 30s to process. That makes small issues escalate into big issues very very quickly.
Yeah, have learnt the hard way over many weekend midnights to know these numbers. This particular problem especially leads to exponential explosions - something needs to be pretty overloaded or computationally expensive to take over 30s (or queue cutoff), and retrying it adds more of the same load and makes other messages slower and therefore causes them to be retried again. Not fun.
> Accidental decoupling is where you have a complex state machine encapsulating a business procedure with multiple steps, and it's coordinated as messages between and actions in multiple services.
That might need emphasis on "in multiple services."
Within the same service, a granular set of messages (events) can still be useful for auditing or creating good read-model "projections" of what happened.
It's true that the messages (state machine transitions) don't have to be a durable source of truth, but there are similar arguments to be made for granularity.
I could add 2c. If you ever need to store some meta alongside the message in a DB. For example status, some execution history, etc. Then it's better to avoid MQ at all. Of course if you can scale DB access from workers and you can couple producers/consumers via the same DB. But it's the case for almost all applications TBH.
I mostly agree, but from a devil's advocate position, the downside is the likelihood that you end up reimplementing queue basics like retries, delay/scheduling, and of course, the essential transactional state flips without locking or perf issues.
In my experience, the downside to the queue is losing all the historical statistics/state that you get for free with a database table. You have to instrument all that stuff manually, since most simple queues are designed to be transient once messages are confirmed.
I usually end up with a hybrid: store a copy of the state in the DB (along with all the job data), and essentially use the queue to hand off an ID or something pointing to the DB. You can then run queries against the best-effort state recorded in the DB, but the queue handles all the at-most and schedule/retry logic I don't want to handcraft.
HTTP is a transport. It doesn't have any other properties besides you could make a request and hopefully get a response. Semantic is defined by actual client/server implementations with corresponding backend.
MQTT is a transport too, but its design facilitates a brokered pub-sub message queue. The transport implementation is effectively the queue, as far as the applications are concerned.
The application doesn't care if it's "just a transport" or "part of an elaborate spec". Functionally there is no difference between an HTTP transaction and a message queue transaction, for a specific kind of message queue. You implement them literally the same way in an application; open socket, connect, send, receive, optional loop, close.
The point is that a message queue is just a concept. It can exist in many forms for many different use cases. You evaluate the type of queue and implement something that provides for its requirements. HTTP does that inherently for certain message queue types, and for others it requires more code.
Another example is the perennial meme of "PostgreSQL queue". A database isn't a message queue, it's a database! Yet people throw some crap into the DB and call it a queue. Because the application doesn't care how you implement the logical concept. HTTP, RDBMS, MQTT broker, whatever.
thanks for noting, 2020 should be in the title, I was about to comment that event bridge (which linked article notes is new) is a rebrand on cloud watch events which have been around since January 2016.
Started using message queues properly back in 2015 with MSMQ (Microsoft Message Queue)
Had an urgent change to make, away from sending data over from one company... to another. It was a big change for small amount of time. I decided to try MSMQ as it was already available on Windows PCs and it worked an absolute treat! Another feature was added a few months later, which created a lot of "messages" than the others but being MSMQ is handled it just fine!
Since then I have been interested in Message Queue technologies. Been a ZeroMQ (and NetMQ) guy since.
I ended up creating a light-weight broker with NetMQ (in C#) to handle my requests, sending data to Workers. It made in completely distributed, and can be placed on different machines, etc. I didn't need a dedicated server.
The broker was using push/pull or router/dealer patterns, and handled issues and loss well.
pub/sub was also supported but not as a "reliable" pattern. It was for notifications but can "retry" if not received.
I have dabbled in others like RabbitMQ and Kafka. Its just I don't need to use them. I guess the only exception will be starting a new job that has 'em. I have a light-weight Broker than I can install easily.
We've started using message queues to decouple legacy code from new code in order to migrate from one platform to another. I modelled our implementation after AMQP so we could trivially switch to either RabbitMQ, Azure Service Bus or similar down the line.
However one thing that seems missing from the standard is retry policies. Sure you could just abandon the message and rely on the timeout, but I'd prefer an exponential backoff in case some external service is down for an extended period to avoid things bogging down.
Are there some standard ways I've missed, or do folks rely on proprietary extensions or extra services for this?
It's not exponential backoff, but I've done this in RabbitMQ with some queue weirdness.
I have the main queue with a reasonably short timeout (30 seconds). That queue is set up with has a dead-letter queue where failed messages that don't get ACKed get moved to.
The dead-letter queue has a TTL of ~5 minutes, where it's dead letter queue is the original queue.
So basically, if a message fails a worker, it gets kicked over to the dead-letter queue, which then moves it back to the main queue after the TTL times out. This foes mean a crashing message will fail forever (so you have to keep a careful eye on how many messages are in the dead-letter queue), but I've managed to work around this so far. Or you can use proprietary extensions (x-delivery-attempts).
> Are there some standard ways I've missed, or do folks rely on proprietary extensions or extra services for this?
As a hack, you can always have your library run its own retry by doing an atomic ack-and-resend-to-the-future (though you need to have bits for retry count if you want exponential back off). And there's situations where it doesn't work well, if the message handler itself crashes too hard on failure.
I mean yea I can do a lot myself given it's my implementation, but I was hoping to keep our messaging code fairly generic so it'd be easy to use either RabbitMQ or Service Bus, depending on if customer wanted on-prem or hosted installation for example.
* Databases are not as transaction-safe (in their default setup) as one is led to believe. They are ACID for some non-isolated definition of I. It's popped up a number of times in the last few months on HN and was news to me: https://news.ycombinator.com/item?id=38736904
* Even if an ACID transaction protects you from concurrency issues, it won't protect you (more specifically - your data) from logic issues. Bad code will make your database data wrong, even after you push a fix to the code.
* The transactions can only span one database, which can limit your system design a bit.
Embrace eventual consistency and use a persistent, append-only message queue, and these problems go away. The downside is you'll be accused of chasing fads and doing resume-driven development ;)
Yes they are. Transaction-safe does not mean "protecting against concurrency issues". It just means that a transaction either succeeds or fails, nothing more and nothing less. Then there is isolation which is an orthogonal thing. So let's use the correct wording here.
> it won't protect you (more specifically - your data) from logic issues
Nothing protects against logic issues though.
> Embrace eventual consistency and use a persistent, append-only message queue, and these problems go away
I agree that eventual consistency should be embraced, but it doesn't make the mentioned problems go away. How would a persistent, append-only message queue allow for solve the C of CAP while retaining A and P? Of course it can't. How does it protect you from logic issues? Of course it can't.
> Transaction-safe does not mean "protecting against concurrency issues".
That is the I in ACID.
> It just means that a transaction either succeeds or fails, nothing more and nothing less.
That is the A in ACID.
If you can say isolation is orthogonal, I can say atomicity is orthogonal.
In any case, I was refuting the idea that transactions in a standard ACID database would protect you from, e.g., two message queue workers picking up and processing the same message when they are not supposed to. But if that's not a part of 'transaction-safe', then you're essentially backing up my point.
>> it won't protect you (more specifically - your data) from logic issues
> Nothing protects against logic issues though.
That's a given. If you destructively update your business data with bad logic, the data's gone. You're screwed. You probably won't even know it happened until you get complaints. And once you get complaints, you won't know how many rows in your database were affected, or what their values should be.
With an append-only MQ, you will let a few bad actions happen - logic errors, and consistency errors resulting from too much 'eventual'. You can spot these actions since you didn't delete them. Then you can append the corrections to the MQ.
That is isolation. It's the first time hearing someone call that "transaction safety". In my opinion they are not synonyms. If anything, transaction safety would be something that the database authors have to consider, not the user of a database. A quick google search and chatgpt question agree.
> That is the A in ACID.
Yeah, that is what I assumed you meant when you said "transaction safety".
> If you can say isolation is orthogonal, I can say atomicity is orthogonal.
Yes you can indeed.
> In any case, I was refuting the idea that transactions in a standard ACID database would protect you from, e.g., two message queue workers picking up and processing the same message when they are not supposed to. But if that's not a part of 'transaction-safe', then you're essentially backing up my point.
Then I think it's just a language matter. There is no automatic guarantee against e.g. two message queue workers picking up the same message. There is no guarantees like that in general for transactions and there is fine-granular control with things like "read for update" or "skip locked" for that. I don't think any database ever claimed otherwise.
However, using databases allows the user to opt into these things.
> With an append-only MQ, you will let a few bad actions happen - logic errors, and consistency errors resulting from too much 'eventual'. You can spot these actions since you didn't delete them. Then you can append the corrections to the MQ.
Or, you allowed a user the withdraw cash at two ATMs at the same time, and now their balance is negative even if it must not ever be negative. You just broke the law and might use your banking license and go bankcrupt.
So yeah, in those cases an append-only MQ simply doesn't help you. You need some system that supports ACID.
> That is isolation. It's the first time hearing someone call that "transaction safety".
I wouldn't call isolation transaction safety. ACID is transaction safety. Isolation is a subset of transaction safety. 'Safe' isn't even my word. I was replying to a comment. Can you honestly say that you'd substitute:
Most users want to use a message queue in a transaction safe way, so they ended up implementing it in database.
with:
Most users want to use a message queue in an [atomic] way [but not necessarily consistent, isolated, or durable - which are orthogonal concerns], so they ended up implementing it in database.
From this point forward I honestly can't follow the line of argument:
> If anything, transaction safety would be something that the database authors have to consider, not the user of a database [a]
> There is no automatic guarantee against e.g. two message queue workers picking up the same message. There is no guarantees like that in general for transactions [b] and there is fine-granular control with things like "read for update" or "skip locked" for that [c]. I don't think any database ever claimed otherwise [d].
So: the user doesn't need to think about their transactions' safety [a], but if they choose to think about it, they can opt-into safety features [c], but those features won't work as advertised [b]. Actually they weren't advertised [d].
> Or, you allowed a user the withdraw cash at two ATMs at the same time, and now their balance is negative even if it must not ever be negative.
Firstly, that's call an overdraft and it's business-as-usual for ATMs. ATMs (and banks in a broader sense) track money using eventual consistency (ledgers & reconciliation), not transactions (in the 'SQL DBMS Transaction' sense). [2]
Secondly, why did you pick the two-ATMs-at-the-same-time hypothetical if you're downplaying isolation as a transaction safety concern? That is the purpose of isolation. If your system succeeds at one-thing-at-a-time, but fails with two, isolation is the property which has been violated.
> You just broke the law and might use your banking license and go bankcrupt.
Allowing overdrafts does not break the law. You know what breaks the law in banking? UPDATE statements. Mutations which destructively overwrite banking data. Moving customers' money around by mutating balances in-place. Which is why they don't work that way [2, 3].
> So yeah, in those cases an append-only MQ simply doesn't help you. You need some system that supports ACID.
And here we've come full circle! I can start quoting from the top again:
>> Databases are not as transaction-safe (in their default setup) as one is led to believe. They are ACID for some non-isolated definition of I.
I think we can cut it short. Your OP never claimed that databases automatically apply the desired acid properties. I think his point was simply that's it's possible with them and not without them (without building your own acid system). No need to discuss it further.
I think you merely interpreted them in a very unfortunate way.
There are Transaction Coordinators that will allow you to span transactions across multiple systems - databases, OR MQ Brokers.
This means you can begin a DB transaction, enlist the MQ GET operation, and then commit or rollback the whole lot. It obviously slows things down a tad, but ensures consistency across all actors.
Ask HN: What message queue software has a native HTTP receiver?
RabbitMQ has one, but it's actually a management interface and I really don't want to expose it to the whole world.
X/Y problem: I need to send data from clients without custom protocols (so no libs bundled) over a plain HTTP/S (so it can traverse firewall in 99.9%) to the public endpoint.
I don't really care about the most features of the message queues, but I definitely don't want to write one.
I had a successful deployment on a very similar task with RabbitMQ, but it was fully internal and had no security requirements.
If you're up for AWS then the SNS-SQS combo works great. If you're not into giving short term credentials to clients then a lambda (with HTTP invocation) should work fine.
For bare servers maybe NATS? Think it had HTTP/S support.
What's funny is Pub/Sub is a fundamentally different model from message queues: queues are normally tightly coupled, meaning you enqueue a message to be processed by a specific system. Pub/Sub on the other hand is fundamentally loosely coupled: the publisher doesn't care who (if anyone) gets the message.
(I've heard "orchestration" vs "choreography" to describe this dichotomy, but I can't say I'm a fan of the jargon.)