More

KraftyOne · 2025-04-03T23:39:51 1743723591

(DBOS co-founder here) From a DBOS perspective, the biggest differences are that DBOS runs in-process instead of on an external server, and DBOS lets you write worklflows as code instead of explicit DAGs. I'm less familiar with Hatchet, but here's a blog post comparing DBOS with Temporal, which also uses external orchestration for durable execution: https://www.dbos.dev/blog/durable-execution-coding-compariso...

abelanger · 2025-04-04T00:52:18 1743727938

> and DBOS lets you write worklflows as code instead of explicit DAGs

To clarify, Hatchet supports both DAGs and workflows as code: see https://docs.hatchet.run/home/child-spawning and https://docs.hatchet.run/home/durable-execution

KraftyOne · 2025-03-31T21:04:14 1743455054

(DBOS co-founder here) DBOS embeds durable execution into your app as a library backed by Postgres, whereas Restate provides durable execution as a service.

In my opinion, this makes DBOS more lightweight and easier to integrate into an existing application. To use DBOS, you just install the library and annotate workflows and steps in your program. DBOS will checkpoint your workflows and steps in Postgres to make them durable and recover them from failures, but will otherwise leave your application alone. By contrast, to use Restate, you need to split out your durable code into a separate worker service and use the Restate server to dispatch events to it. You're essentially outsourcing control flow and event processing to the Restate server., which will require some rearchitecting.

Here's a blog post with more detail comparing DBOS and Temporal, whose model is similar to (but not the same as!) Restate: https://www.dbos.dev/blog/durable-execution-coding-compariso...

KraftyOne · 2025-02-03T16:17:59 1738599479

Agreed, if you can do something completely synchronously while responding to an HTTP request, you should.

But often you can't! Then, durable execution helps you manage the complexity of async processing.

KraftyOne · 2025-02-03T16:16:57 1738599417

DBOS solves this problem by storing state in Postgres, which is really good at coordinating multiple copies of the same service. Essentially, Postgres does the hard parts of external orchestration, letting you work with a simple library abstraction.

KraftyOne · 2025-02-03T14:59:56 1738594796

This is a great answer, and yes, those are critical aspects of durable execution. Maybe I should write a follow-on post that goes into more detail...

shipp02 · 2025-02-03T20:42:40 1738615360

I'd love to read it. Getting exactly once semantics is quite an interesting topic.

KraftyOne · 2025-02-03T14:58:21 1738594701

Thanks! DBOS is simpler not because it ignores complexity, but because it uses Postgres to deal with complexity. And Postgres is a very powerful tool for building reliable systems!

octonaut · 2025-02-04T06:00:21 1738648821

Temporal has the option of using postgres as the persistence backend. Presumably, the simplicity of DBOS comes from not having to spin up a webserver and workflow engine to orchestrate the functions?

shipp02 · 2025-02-03T20:44:16 1738615456

Have you had scalability issues because your tables got too big?

Is there a mechanism to GC workflows that are completed?

KraftyOne · 2025-02-03T21:01:32 1738616492

Tables getting too big hasn't been a concern in practice because information on completed workflows can easily be GC'ed.

KraftyOne · 2025-02-03T14:54:38 1738594478

Yes, that's totally fair. Usually, a step is a meaningful unit of work, such as a API call that performs an external state modification. Because each step is a fair chunk of work, and the overhead is just one write per step, this scales well in practice--as well as Postgres scales, up to 10K+ operations/second.

CMCDragonkai · 2025-02-03T16:47:27 1738601247

I feel like you can generalise this to any transactional key value system, which can scale better.

KraftyOne · 2025-02-03T14:49:50 1738594190

That's exactly what this model is! The @Step decorator is for external state modifications. Then @Workflows orchestrate steps. The example shows the simplest possible external state modification--a print to the terminal.

Steps can be tried multiple times (if a failure happens mid-step) but never re-execute once complete. Since idempotency can't be added externally, that's the strongest possible guarantee any orchestration system can give you (and if your step is performing an idempotent operation, which is the safest thing, you can use the workflow ID as an idempotency key). More details in the docs: https://docs.dbos.dev/python/tutorials/workflow-tutorial#rel...

ethbr1 · 2025-02-03T15:00:02 1738594802

You're forcing adopters to divide any state-impactful activity into its own function (because only functions can be decorated with step, no?). That's seriously inelegant when scaled to larger codebases.

Regional tagging (e.g. safe/unsafe) would be a better approach, as it would allow developers to more naturally protect code, without redefining its structure to suit your library.

You start to grok the problem here, but primarily think about it in terms of databases, which are just one (admittedly common) type of external state:

>> If you need to perform a non-deterministic operation like accessing the database, calling a third-party API, generating a random number, or getting the local time, you shouldn't do it directly in a workflow function. Instead, you should do all database operations in transactions and all other non-deterministic operations in steps.

Note: Think you should really change "all" into "each in a separate transaction/step" there, to communicate what you're recommending?

As a thought exercise: imagine a Python program that automates a third party application via the GUI. Some UI actions cannot be undone (e.g. submit). Some are repeatable without consequence (e.g. navigating between screens).

How would your framework support that?

Because if you can efficiently support the pathological leaky-state case, you can trivially support all simpler cases.

KraftyOne · 2025-02-03T16:22:45 1738599765

Yeah, this definitely requires splitting state-impactful activity into its own function. That's good practice anyways, though I understand it might be a pain in large codebases. Regional tags are definitely an interesting alternative!

For the UI example, I don't think you'd use durable execution for most of the UI--it's just not needed. But maybe there's one button that launches a complex asynchronous background task, and you'd use durable execution for that (with careful workflow ID management to ensure idempotency and allow you to retrieve the status of the background task).

ethbr1 · 2025-02-04T02:59:23 1738637963

> splitting state-impactful activity into its own function.

Each separate state-impactful activity into its own function, no?

> For the UI example...

I wasn't talking about building a GUI in a python program: I was talking about using python to automate an external GUI app.

Because that toy example (which also happens to be done in the real world!) encapsulates a lot of the state issues that don't seem well-handled by the current design.

Namely, that external interactions are a mix between stateless and stateful operations.

KraftyOne · 2025-02-03T14:46:58 1738594018

That's really interesting! It does seem that this is identically semantically to the library approach (as the logic your interpreter adds around steps could also be added by decorators) but is completely automatic. Which is great if the interpreter always does the right thing, but problematic/overly magical if the interpreter doesn't. For example, if your problem domain has two blocking operations that really form one single step and should be retried together, a library approach lets you express that but an interpreted approach might get it wrong.

KraftyOne · 2025-01-17T17:32:35 1737135155

Generally we recommend against retroaction--the assumed model is that every workflow finishes on the code version it started. This is managed automatically in our hosted version (DBOS Cloud) and there's an API for self-hosting: https://docs.dbos.dev/typescript/tutorials/development/self-...

That said, we know sometimes you have to do surgery on a long-running workflow, and we're looking at adding better tooling for it. It's completely doable because all the state is stored in Postgres tables (https://docs.dbos.dev/explanations/system-tables).