How do you cut a monolith in half?

andrewwharton · on June 30, 2017

After reading Enterprise Integration Patterns [1] a few years ago, I've been under the impression that message queues are the holy grail for decoupling systems that you want to pull apart, but this offers a really nice perspective on where they might not be a panacea. Thank you!

There's a couple of nice ideas in here (the "message pump" for one at least) that I'm going to steal and use at work. It's also comforting to know that we're not completely crazy for using a DB to store tasks/processes that need to be run and retried if necessary instead of a message queue.

[1] https://en.wikipedia.org/wiki/Enterprise_Integration_Pattern...

balfirevic · on June 30, 2017

I know what you meant by saying you use DB _instead_ of message queue but I think it kind of points to a broader point - message queue is a concept and the technology used to implement it is just implementation detail. It just turns out that databases have all kinds of useful and powerful operations that make implementing message queues relatively easy.

vkjv · on June 30, 2017

If you don't need high throughput and otherwise don't have a reason to add a separate technology to the stack just to have a queue, SQL can make a great, simple and reliable queue!

* Postgresql and Oracle with skip locked

* MS SQL Server with readpast

* DB2 with skip locked data

If you use MySQL, it's going to be a bit more difficult.

burkemw3 · on June 30, 2017

I hadn't heard of skip locked before, and very much enjoyed this article walking through it for postgres: https://blog.2ndquadrant.com/what-is-select-skip-locked-for-...

swah · on June 30, 2017

Love the headline "Most work queue implementations in SQL are wrong". That kind of stuff makes me read an article...

This short screencast was also very helpful https://www.pgcasts.com/episodes/7/skip-locked/

monksy · on June 30, 2017

Using the database as a queue is an anti-pattern. There used to be an entire blog series about why this is the case. However, I can't find it at the moment.

eternalvision · on June 30, 2017

With MySQL, there's a fairly simple trick you can do to build a queue. Set a temp var to null, run an update with a select subquery to update state of next row(s), then select that var to get your queue row pk's.

coredog64 · on June 30, 2017

Having been on the receiving end of more than a few apps where the devs used a database as a queue, I say "please don't".

Yes, Oracle's AQ product is built on top of an RDBMS. And yes, early versions of Microsoft's MSMQ used SQL Server. But most dev teams use the database as a shortcut and don't invest the effort in making their hacktastic DB queue work like a real product.

balfirevic · on June 30, 2017

I have used DB as queues very successfully in multiple projects (as well as "real" queues in others), so your "please don't" is not very persuasive. Some actual argument, instead of content-free "hacktastic" vs. "real product" designations would be appreciated.

And you use "shortcut" like some kind of negative attribute. Of course they use DB as a shortcut, and they should - it works well and even has additional features (transactions with your primary store) that you wouldn't get with separate queue product.

FLUX-YOU · on June 30, 2017

Monitoring and administration are more difficult until you write them yourself, especially if you are splitting monolithic code. Many of the message queue products you download include these. Can your DB message queue also support different queue concepts and message routing?

balfirevic · on June 30, 2017

In my case it supported multiple queues (also, a concept of hierarchical queues), but no message routing. As for administration and monitoring - you are correct, you don't get it out of the box. Well, you do get some basic administration for free from whoever administers your database :)

On the other hand, you can grow solution for those organically over time to fit your needs.

rangersanger · on June 30, 2017

In a previous life I was on a team that moved a ecomm backend-Email Service Provider(ESP) from a postgres based queue to rabbitMQ. The motivation was voiced in language similar to yours, "hacktasic" was actually used more than once as a reason to replace the DB with rabbitMQ.

In the end, I'd say they were equally reliable, but from a data analyst's perspective I very much missed the postgres based queue. I think the mechanics inherent to a database solution prompted the original implementers to keep messages around over time vs the mechanics inherent to an MQ solution where they became ephemeral (on the senders side.) Having access to those messages was a treasure trove for troubleshooting and for analytics. That and from a pure cost benefit stance, sending 10-20k messages a day to the ESP definitely didn't necessitate the expensive rearchitecture, as the DB solution was more than capable of handling the load on cheap hardware.

sbov · on June 30, 2017

Yeah, at 10-20k it definitely sounds like overkill. Some of our past systems sent a couple million emails per day using a MySQL backed email queue.

beat · on June 30, 2017

The reason to use message queues rather than RDBMS for queue behaviors isn't because queues do stuff databases don't, but rather because of connection cost. There's a lot of complicated handling noise around the database connection, whereas queues tend to be very inexpensive.

jsight · on June 30, 2017

As an alternate anecdote, I have had far more issues with MQ connections than with RDBMS connections. I don't think there is a clear advantage to messaging from that standpoint.

balfirevic · on June 30, 2017

Do you mean development/complexity cost or performance cost? If it is the former, then I don't really see it. If it is the latter then yes - RDBMS backed queue is probably going to hit scaling limits earlier then dedicated product.

swah · on June 30, 2017

> message queues are the holy grail for decoupling systems

I also like the idea, but the counterpoint is: "now you have to monitor the queue".

(I only work in small apps for my own use so its more like Postgres and forget...)

mattwad · on June 30, 2017

you mean now you CAN monitor the queue... it's been very helpful for us to just throw alerts on queue size in AWS Simple Queue Service.. but we have some apps keeping state in Mongo and you have to write a custom script each time to query and send data to Cloudwatch

hvidgaard · on June 30, 2017

Using message queues to decouple the components suffer the same problem as micro services. You start to get a lot of implicit dependencies that you must document.

ris · on June 30, 2017

Beyond that, you tend to often realize you haven't quite made the logical split of components in quite the right place, or the "right place" for that split to be changes over time. Then you've got the fun task of moving functionality from one component to another, which is extremely expensive in development time.

(or, more often than not, because of the cost of doing this, you end up putting up with a slightly batshit-insane design...)

i_cant_speel · on June 30, 2017

What kind of implicit dependencies?

hvidgaard · on June 30, 2017

Service A sends a message to Service B, so Service A is dependent on Service B to function properly.

With careful management, it can be kept to a minimum, but I haven't seen that work for systems with a large numbers of developers working on it yet.

cableshaft · on June 30, 2017

Or in my experience, Service A sends a message to Service B, which sends a message to Service C, which sends a message to Services D, E, and F, and Service F sends another message to Service C, which this time it sends a message to Service G (so at least it's not completely circular), which then hits the database and returns information back up the chain.

I'm exaggerating a little bit, but not too much (actually, on further reflection, I might actually be downplaying it a bit. some of our services schedule tasks with like 10 different services for every item processed, and we do tens of thousands a day).

Debugging issues in this mess is not fun, because there's so many places you need to check to see if it's the source of the failure or not, and a failure in one service could really be in a different service so you have to test all the way up and down the chain. For every bug.

marcosdumay · on June 30, 2017

That's a problem of badly specified services. You should be able to look at the messages only, and see where a bug starts.

But then, I'm mostly against microservices because they lead to harder problems on every place. Documentation isn't even the worst.

i_cant_speel · on June 30, 2017

But I was under the impression that the point of message brokering is that it doesn't have to be service A that puts it there; only that some service puts the message there.

I feel like the article is attacking message brokering by discussing the disadvantages of bad use cases for them. A good use case for message brokers is when work needs to be done on an item, but not immediately.

My company uses them in a way I believe is quite effective. We pull in data from an external source, and send the ID of the item to about five different queues to do different tasks. Each time one of the queues finishes the work, it send a message to a validator that checks to see if all the work is done. If it isn't, it waits to get that message again from another worker. If it is, it marks that item as ready for the end user.

These are the kinds of use cases I think message brokers should be used for. Not to send a message and wait to get an answer back. Why not just use an HTTP request for that?

hvidgaard · on July 3, 2017

> But I was under the impression that the point of message brokering is that it doesn't have to be service A that puts it there; only that some service puts the message there.

It doesn't matter, but it's more about debugging. If Service A does not work correctly, where is the bug? Is i Service A? Service B? The network?

The use at your company, is idiomatic to the paradigm. You have n different units of work, that can run seperately, so you do that and communicate with messages.

i_cant_speel · on July 3, 2017

If you are using messages correctly, it shouldn't be difficult to debug. You have an input and output for each service and you see where something happens differently than expected. I'm not sure where you are saying the difficulty comes from.

acdha · on June 30, 2017

That matches my experience — I've seen a fairly common learning process where someone adopts a message queue, decides it's great and uses it all over the place, and then spends awhile working through the various failures which didn't happen in their development/testing environment so they're making decisions about how to handle dropped messages, duplicates, etc. in a rush.

It's not that hard to do but it seems to take people by surprise and it's not helped by some poor defaults like almost everything in the RabbitMQ ecosystem silently blocking when the queue fills up (this probably happened transiently multiple times before it hit the level where it caused a visible outage but how many people will notice that if the default isn't to log or raise an error?).

eldavido · on June 30, 2017

I've been working in a related space for a few years and wanted to offer a few counterpoints to the article.

First, if you're doing request/response using messaging, you're probably doing it wrong. Pub/sub and request/response are totally different animals. I for one, consider it both reasonable and necessary to use both side-by-side, in the same infrastructure. (Is this view uncommon?)

In our technology stack, which is a monolith-becoming-microservices, we use both pub/sub and request/response side-by-side. The general rule is that if service A calls service B, if the nature of that interaction is such that service B's response can preempt/interrupt/influence service A, the call needs to be done inline, in service A. If the nature of the interaction is more "advisory", use pub/sub.

Examples (from the hotel booking space): (a) When a reservation gets canceled, we publish a cancellation event. The reservation is then CANCELED, officially. A separate service sees the cancellation event and frees the associated held room inventory; that's a pub/sub interaction. (b) When a reservation wants to check in to a room, we check whether the room is already occupied. This has to be done using request/response (in our case, gRPC) because if the room is occupied, that's a hard gate on the success of the checkin.

Second, pub/sub != work queues.

Pub/sub is about distributing small bits of information all over the system and letting things be advised of stuff. Using a messaging system as a work queue is overall pretty stupid. I know it's common to use a message broker like RabbitMQ for task distribution, but it's silly. It's really silly when the tasks themselves contain huge binary objects inline, as part of the message payload. Store that shit in S3 or a proper system, and keep the message payloads light.

I guess that's all for now.

aidid · on June 30, 2017

The way it's been explained to me.

A pub/sub relationship should convey metadata of state change. Any state change in the system should be communicated via pub/sub.

A subscriber might need to use a request/response interaction with some other microservice in order to act on a pub/sub, but that's not a state change, that's just auxiliary data needed to do its job that is triggered on a state change.

dozzie · on June 30, 2017

> Pub/sub and request/response are totally different animals. [...] (Is this view uncommon?)

One model is synchronous, the other is asynchronous. I don't see why would anybody have any doubts.

> Second, pub/sub != work queues.

Publish/subscribe model doesn't say anything about preserving and acknowledging the messages, and with work queues usually only one worker takes the queued job. I don't see why would anybody mistake one for the other.

bhassfurt · on July 1, 2017

That was an insightful read, thanks for your input.

omegaworks · on June 30, 2017

What do you use for task distribution?

eldavido · on June 30, 2017

Surprisingly, we don't any long-running tasks.

sametmax · on June 30, 2017

This article is not just very well written, it's also very funny. Some gold:

> a message broker is a service that transforms network errors and machine failures into filled disks

...

> mark it as required in the database, and wait for something else to handle it. > > Assuming that something else isn’t a human who has been paged.

...

> Systems grow by pushing responsibilities to the edges

...

> A distributed system is something you can draw on a whiteboard pretty quickly, but it’ll take hours to explain how all the pieces interact.

I love when you can mix technicals and not taking yourself seriously.

Although be careful with using DB as a task queue. Concurrency is a b* and message brokers are very good at it. AMPQ has been created because the authors started with a BD message broker and it didn't work.

A task queue is message broker + persistance + status. Celery does that very well in the Python word, and works with rabbitmq, redis, postgres, etc.

What amaze me the most is autobahn + crossbar.io. It does PUB/SUB, RPC, load balancing and all the stuff for Python, JS, PHP, C#, Java... And it works even in the browser. Cool stuff.

marcosdumay · on June 30, 2017

> Concurrency is a b*

I have to say. I spent way too much time trying to understand what b* trees have to do with concurrency, and what system you are using that implements them.

102030485868 · on June 30, 2017

I interpreted that as, "Concurrency is a bitch...".

marcosdumay · on June 30, 2017

Yeah, that's the correct interpretation. I got there, eventually.

bjflanne · on June 30, 2017

A follow up from a reader on this article: http://bravenewgeek.com/smart-endpoints-dumb-pipes/

BenoitEssiambre · on June 30, 2017

It seems to me that there are great benefits to be had if you can keep the transactional integrity of direct connections to a relational database like PostgreSQL. The performance and scalability is often better than people expect because the data stays closer to CPUs and it avoids moving too much across multiple nodes on the relatively slow and unreliable network.

In a lot of cases, there are natural separation lines such as between groups of customers you can use to shard and scale things up. Unless you are building something like a social network where everything is connected to everything, you don't need a database that runs over a large cluster or clustered queues in between components. These are often just more moving parts that can break.

smilliken · on June 30, 2017

The benefits of transactional queues in your database are hard to overstate; commit the result of the task in the same transaction as you commit the queue update. Don't worry about idempotency, lost messages, or duplicate messages.

I suspect the advice to avoid it because of performance has become invalid for all but extreme use-cases. My company has dozens of high activity 1-100M item queues in single postgresql databases. It works great.

virmundi · on June 30, 2017

If you segment your databases into micro services, it think you loose transactions. Say two services talk via DB queue. You'll need to store the DB for each service in the same management server, but different schemas. If you ever want to move your DB for one service, you have to introduce distributed transactions. That it drop them.

eddd · on June 30, 2017

Queue is not the holy grail, especially if you want to put something on the queue during the db transaction. Be aware, that once you do I/O (some RPC or queue interaction) you loose the ACID in the db, you will make the system much less reliable this way.

The way to do this, it to publish on queue _after_ transaction, but also be sure that the action won't be lost in the process.

https://martin.kleppmann.com/2015/04/23/bottled-water-real-t...

http://dataintensive.net/

nurettin · on June 30, 2017

The term SOA is always mentioned with the underlying connotation that a client requests a resource and somehow there is an extra step of "service discovery" which has it's own set of problems that branch out into various fields of physics and mathematics.

When I see such branching complexity, I often think that the architecture is somehow wrongly backwards and try a simple reversal of responsibilities. In this case, the services would be looking for a job to complete, in effect turning the afromentioned step into "job discovery" which is indeed the kind of architecture I've been applying the past decade.

Seems to be working well so far. Back pressure is handled at the front gate as none of the services pick up on the job, involuntary synchronization is still a problem, but avoidable by cleverly re-ordering the job queue. The job completion is communicated back to the front gate through pub/sub and the anecdotal evidence so far has been great.

sciurus · on June 30, 2017

Isn't your architecture exactly what the author is arguing against?

nurettin · on July 1, 2017

Because I use pub/sub instead of polling the database? I thought that was just common sense.

Terr_ · on June 30, 2017

Dealing with fairly low-volume stuff, my concern about using the database alone has to do with who owns what schema and how changes are managed.

Being able to send off a rich asynchronous message is nice because it means you do not need to have some co-owned table in a shared database that two different components are reading and writing from.

Or, worse, a widespread pattern of every service exposing a piece of its database to other services with an unsatisfying level of logging or control for what really happens.

woodrowbarlow · on June 30, 2017

not all databases require rigid schemas. a filesystem, for instance.

Terr_ · on July 1, 2017

I don't understand what distinction you're trying to raise.

My point is that a message-queue allows you decouple systems, whereas having two systems share the same database tables has its own kind of peril.

asmosoinio · on June 30, 2017

Money quote in my humble opinion:

> In practice, a message broker is a service that transforms network errors and machine failures into filled disks.

Terr_ · on July 1, 2017

Of course, the alternative before that may have been "any network error or machine error brings everything down".

When it's asychronous, it's easier to tolerate random errors or downtime, but the cost is that you have to store the message somewhere...

mavhc · on June 30, 2017

Expected story about 2001, was disappointed.

Good overview from my perspective of not knowing much about such systems.

cpeterso · on June 30, 2017

I was expecting a geometry puzzle about subdividing a 1:4:9 monolith into smaller 1:4:9 monoliths. :)

FigmentEngine · on June 30, 2017

just before the "l"

douche · on June 30, 2017

Hmm, I submitted this yesterday, but it got lost in the scrum.

It's always interesting to hear real reports from the trenches that aren't essentially ads for technology XYZ. I'm afraid that all too often, we make things more complicated for bad reasons; whether ignorance, chasing the latest trend, or resume-driven-development...

robertlagrant · on June 30, 2017

Message-oriented stuff has been around a long time, so I don't think it's a fad.

It's basically taking the concept of "integration" (or API) itself and creating a product for it, just as a database is a product for the concept of persistence. Thus just as not every application has to reinvent a database, with a messaging product not every product has to reinvent queueing up integration calls if the target system isn't available.

The article also mentions request-reply being "what you really want", I think that this is a) not true, as a lot of the time you can fire and forget, and b) when you need it the products generally provide a request-reply API on top of their lower-level APIs. No need to reinvent.

scaryclam · on June 30, 2017

I got the impression that the author didn't really want a message based system at all, but rather a request response system that they tried to impliment within a message broker. Of course that's going to create more headaches than it solves; it's the wrong solution.

Both message brokers and request-response code have their places in distributed systems, but they really need to learn when each is appropriate.

I agree, saying that request-reply is "what you really want" was kind of silly, especially after the opening paragraph that states "it depends".

ambicapter · on June 30, 2017

Can you elucidate the differences between message-broker and request-response use cases in your eyes?

marcosdumay · on June 30, 2017

In my eyes (where the GP makes perfect sense), a message broker is asynchronous, there's no implicit wait while your consumers work on our request. A request-response interface will stop the producer until the consumer is done.