Do queue operations (enqueue a job & mark this job as complete) happen in the same transaction as my business logic?
Imo that's the killer feature of database-based queues, because it dramatically simplifies reasoning about retries, i.e. "did my endpoint logic commit _and_ my background operation enqueue both atomically commit, or atomically fail"?
Same thing for performing jobs, if my worker's business logic commits, but the job later retries (b/c marking the job as committed is a separate transaction), then oof, that's annoying.
My understanding is that hatchet isn’t just a queue, it’s a workflow orchestrator: you can use it as a queue but it’s kind of like using a computer as a calculator: it works but indeed it’d likely be simpler to use a calculator.
On your point of using transactions for idempotency: you’re right that it’s a great advantage of a db-based queue, but I’d be wary about taking it as a holy grail for a few reasons:
- it locks you into using a db-based queue. If for any reason you don’t want to anymore (eg you’re reaching scalability issues) it’ll be very difficult to switch to another queue system as you’re relying on transactions for idempotency.
- you only get transactional idempotency for db operations. Any other side effect won’t be automatically idempotent: external API calls, sending messages to other queues, writing files…
- if you decide to move some of your domain to another service, you lose transactional idempotency (it’s now two databases)
- relying on transactionality means you’re not resilient to having duplicate tasks in the queue (duplicate publishing). That can easily happen: bug of the publisher, two users triggering an action concurrently… it’s quite often a very normal thing to trigger the same action multiple times
So I’d avoid having my tasks rely on transactionality for idempotency, your system is much more resilient if you don’t
They’re not talking about distributed transactions: it’s not about a task being published and consumed atomically, it’s about it being consumed and executed atomically.
Working on v2 of our n+1-proof/reactive TypeScript ORM, Joist (https://joist-orm.io/), that moves to using the new-ish postgres.js driver (instead of knex/node-pg), so that we can leverage postgres.js's statement pipelining within transactions.
I'm anticipating a really sweet perf increase (as shown by some proof-of-concepts), but now that everything is actually working on the v2 branch, I'm putting together benchmarks that show the benefit in practice.
Love to have anyone poke around/ask questions/hang out on discord.
Same! We've also been using the polyfill in production for a ~year or so.
We recently configured the node-pg driver to leave the postgres DATE / TIMESTAMP / TIMESTAMPTZ dates as "just strings" (instead of its default parsing to JS dates), and then we go straight from strings -> temporals at our ORM layer, without the jank/hop of it first being a JS Date.
But it's been great, having `DATE` in the db => `PlainDate` in the TS backend => `Date` in GraphQL API => `PlainDate` again in the TS/React FE. Very clean.
I'm building "ActiveRecord but with type(script)" at https://joist-orm.io/ and have our own internal "kinda Rails w/GraphQL" going internally--so far we're at ~450 tables and still "feeling good" (no n+1s ever!), but definitely a long way to go before AirBNB-sized problems.
Using raw SQL directly is doable, but it means you're responsible for maintaining the business logic & validation rules of every single hand-written INSERT, UPDATE, and DELETE query in your codebase.
Personally I don't trust myself to remember to do that :-) hence preferring entity-based ORMs:
(That said, I definitely "know SQL" and use raw SQL queries for the ~5% of queries in a CRUD/SaaS app that are actually unique/non-boilerplate, instead of forcing them to go through an obtuse query builder DSL.)
I mean, of course they are a good idea, what we need is more examples of actually doing them in practice. :-)
I.e. quoting from the post:
- monolithic databases need to be broken up
- Tables must be grouped by module and isolated from other modules
- Tables must then be migrated to separate schemas
- I am not aware of any tools that help detect such boundaries
Exactly.
For as much press as "modular monoliths" have gotten, breaking up a large codebase is cool/fine/whatever--breaking up a large domain model is imo the "killer app" of modular monoliths, and what we're missing (basically the Rails of modular monoliths).
Any "query in a loop", whether a manually-written `for` loop, or an "emergent for loop" from 100 `after_update` lifecycle hooks firing at once, are always auto-batched, every time.
Joist is also likely why we actually like GraphQL, vs. it generally having a reputation for being a pita to implement (N+1s everywhere w/o tedious, manual batching).
Maybe its Stockholm syndrome after ~4-5 years of TypeScript, but I like knowing "this method call is going to do I/O somewhere" (that its red).
To the point where I consider "colorless functions" to be a leaky abstraction; i.e. I do a lot of ORM stuff, and "I'll just call author.getBooks().get(0) b/c that is a cheap, in-memory, synchronous collection access ... oh wait its actually a colorless SQL call that blocks (sometimes)" imo led to ~majority of ORM backlash/N+1s/etc.
Maybe my preference for "expressing IO in the type system" means in another ~4-5 years, I'll be a Haskell convert, or using Effect.ts to "fix Promise not being a true monad" but so far I feel like the JS Promise/async/await really is just fine.
This will return A number of (author) rows, where each author row as a `_b` column that is a JSON array of that author's books. And if you asked for C comment rows, then one of the "columns"/indices in each of those book arrays will itself be a nested array of the book's comments.
So all the data gets returned, but with no repetition/Cartesian product.
Hat tip to PostGraphile where I first saw the "cross join later" + "json_agg" technique used. Disclaimer we're not actually running this Joist feature in production yet--just haven't had time to run it through the test suite.
I wasn't aware of Joist until I read a comment of yours somewhere else in this comment section, interesting stuff! I think Joist is the closest thing I've seen to an actual fix to this problem. The way you use the JS event loop as your 'scope' is pretty clever, and satisfies most use-cases I think.
It aligns pretty closely to what we do internally. We don't use classes or annotations to do entity definitions, instead we'll have something like this:
Then the way to query and save is something like this:
datasource.inTransaction(async (orm) => {
const users = await orm.findMany('User', { where: sql`...`, orderBy: sql`email asc` });
// Await in this loop might look bad, but only blocks for the first user
for (const user of users) {
console.log(user.email, (await user.role).name);
}
const adminRole = await orm.findOne('Role', 1); // by ID, throws if not found
return orm.saveMany('User', users.map(u => ({ ...u, role: adminrole }));
});
Which looks different from what Joist is doing, but the fetching mechanics under the hood are very similar I think. Our 'scope' is just somewhat more explicit, since we don't use the event loop mechanics of JS but rather use the inTransaction callback as the boundary.
(The reason we haven't open sourced this yet is that I'd like to iron out some of the ugly parts first, like the string entity identifiers, and the forced usage of an id column, and the lack of type safe where clauses right now)
Imo that's the killer feature of database-based queues, because it dramatically simplifies reasoning about retries, i.e. "did my endpoint logic commit _and_ my background operation enqueue both atomically commit, or atomically fail"?
Same thing for performing jobs, if my worker's business logic commits, but the job later retries (b/c marking the job as committed is a separate transaction), then oof, that's annoying.
And I might as well be using SQS at that point.
reply