Hacker News new | past | comments | ask | show | jobs | submit login
Implementing State Machines in PostgreSQL (felixge.de)
199 points by nreece on July 31, 2017 | hide | past | favorite | 86 comments



Thank you for sharing! My impression is that database systems are increasingly gravitating towards Prolog, with various extensions such as logical rules, more expressive aggregation, state machines, constraints, Turing completeness, ... All these sound very familiar to Prolog programmers.

Only recently, there was a post on GRAKN.AI which seemed heavily inspired by Prolog.

This is good news for Prolog: Modern Prolog systems provide many features that are important in the domain of databases, such as JIT indexing, transactions, and dedicated mechanisms for semantic data.


Can you recommend a good in depth introduction to "productive" (in contrast to a more academic approach) modern prolog for someone very superficially familiar to it (think uni course, long time ago)? I heard that e.g. Constraint Logic Programming functionalities makes some old approaches obsolete, and thus going through old material as starting point is very ineffective.


Please see my profile page: It contains several links to material that I recommend for learning modern Prolog.

You can quite often apply Prolog in actual practice if you know it. For instance, see how often "logic" is mentioned just in the context of the present discussion. It's nice to know a logic programming language that can elegantly express business logic and business rules!


I found it hard to use entities with more than 4 Attributes when defining them as simple compounds like

person(Firstname, Lastname, Gender, Age, State, Country).

Accessing the attributes by position makes the code hard to read because one has to remember the position and mistakes are not catched because there is no type system.

Would you recommend using the libray record or using dicts in swi-prolog? This would make the code non portable. Moreover, do you recommend dedicated strings or list of atoms to represent strings?


Hey! Grakn person here.

The Prolog syntax is very elegant but it could be clunky, I totally agree. Our syntax has resource and role names instead of relying on position:

e.g.

match $x has lastname "some last name";


Cool example. But, doing this will create a tight coupling between business logic , the state machine, and storage/persistence, Postgres.

If ever you decide you want to migrate away from Postgres, you'll need to rewrite all the business logic into some other out of process language. If ever you want to scale this, such that you want to calculate states in batches, you'd want to have the business logic somewhere else...


Author here.

> If ever you decide you want to migrate away from Postgres, you'll need to rewrite all the business logic into some other out of process language.

Yeah, but I don't consider this a bad thing. IMO migrating between databases should always require a lot of rewriting. If it doesn't, you're most likely underutilizing the features provided by your database.

> If ever you want to scale this, such that you want to calculate states in batches, you'd want to have the business logic somewhere else...

I'm not sure I follow. The application I'm using this in has > 1 billion rows and we frequently re-compute the state of all our entities across the entire data set in batches after we make changes to our logic. Having the code that does this in the database avoids having to move large amounts of data between our db and the application.


> IMO migrating between databases should always require a lot of rewriting. If it doesn't, you're most likely underutilizing the features provided by your database.

Very likely true. But when you get locked into Oracle and it makes it nearly impossible to move because of the features your relying on, has a big effect on shaping your perspective on this.

> The application I'm using this in has > 1 billion rows and we frequently re-compute the state of all our entities across the entire data set in batches after we make changes to our logic.

That is very impressive. Mind sharing the rate of change across all of those rows?

Also, I'm curious. Is this all that DB instance does? Or is it responsible for other data as well?


> But when you get locked into Oracle and it makes it nearly impossible to move because of the features your relying on, has a big effect on shaping your perspective on this.

I hear you. Picking a database is major decision and shouldn't be made lightly. That being said, I hear a lot of companies are successfully migrating from Oracle to Postgres these days. And in fact, this application was actually migrated from Couchbase to Postgres, but that's a story for another day perhaps :).

> That is very impressive. Mind sharing the rate of change across all of those rows?

Nowadays its about 5 million inserts / day with peak rates of 150 inserts / sec. Not too crazy, but it adds up over time :).

> Also, I'm curious. Is this all that DB instance does? Or is it responsible for other data as well?

It does a few other things, but this is the main workload.


The only thing worse than paying through the nose for Oracle is paying through the nose for Oracle and treating it like a really expensive version of MySQL/Postgres. If you're paying for it, might as well get some value for your money and use those features.


Well, this is what the SQL standard is for: to reduce the amount of work to be done in migrating between databases.

That said, I much prefer to put all this logic into SQL, and to use PostgreSQL, than the alternatives. There's any number of reasons for this:

- direct access to the DB is not dangerous if the logic needed to keep it consistent... is in the DB

- you don't have to replicate this logic if you add new front-ends

- SQL is very expressive, so you'll end up having less code using SQL than anything else -- this also means that RDBMS migration costs need not be as high as one might think, since there will be less SQL than the equivalent for the front-end

- updating business logic is easier to do atomically


Yeah, the idea of having a SQL standard is great. But in practice a "standard" is somewhat meaningless without rigorous compliance testing, and unfortunately SQL hasn't had this for 20 years [1]. But I'll take SQL over any of poor reinvention of relational algebra any day, so it's still better than the alternatives by far :).

[1] http://www.techrepublic.com/article/is-sql-a-standard-anymor...


I might add that it's also fairly easy these days to unit test in the database itself, especially in pg, so moving logic down is a perfectly acceptable practice.


> If ever you decide you want to migrate away from Postgres[…]

it's one of the things to keep in mind. Depending on your application, migrating away from postgres will be more or less painful. If you're already deeply invested in postgres features, this is just one more problem to solve.

> If ever you want to scale this, such that you want to calculate states in batches, you'd want to have the business logic somewhere else

yes. This has the all the usual issues of putting (some) business logic into the database. On the other hand, by using this, you're basically just creating a data integrity constraint similar to a foreign key, just one not as widely supported.

Still. If you ever plan to move to a database that doesn't support foreign key constraints, you will have to implement the business logic somewhere else.

For me, data integrity is paramount. If ever I can put a data integrity check directly into the database, I will do it because bugs in the application logic can exist and when the database itself enforces integrity constraint, I'm protected from those.

I don't want to have to deal with, to stay in the framework of the article, a shipped, but unpaid order. Was this a bug in the application? Did it actually ship? Did the payment fail?

If the database blows up on any attempts to store invalid data, I'm protected from having to ask these questions.


> If the database blows up on any attempts to store invalid data, I'm protected from having to ask these questions.

I'm always surprised when people fight using constraints. During dev, I see them as a godsend for spotting problems - I don't know how many bugs having self-enforcing data structures has caught.

I will say that I have, under protest, turned them off in production once for performance. A particular heavily-used flow was annoying due to a large number of FK checks on an intermediate step. But never during development, and given the number of FK-violation errors I've seen in production code, preferably not even then.

Fixing code is almost always so much easier than fixing data.


I hear the argument about switching storage layers a lot, and I don't completely disagree with it, but in my 20+ years writing code, I've never found a case where switching storage engines didn't cause a massive rewrite even when the storage layer was used in an agnostic way. I'm sure there are examples where it has worked, but saying it as if it's a maxim just feels wrong to me.


It feels like you can go an entire career without switching databases. And if you still have most of the original dev team, the cost of rewrites are probably not as high as they're made out to be (by telling everyone to plan for switching databases at some undefined point in the future)

Loose coupling has a lot of other advantages but this seems to have become the biggest selling point for a loosely-coupled data layer.


> It feels like you can go an entire career without switching databases.

I wish I could say that. I've done more than I care to count. The issue is usually one of the following:

1) converting between types of storage layers (SQL/NoSQL/flat file). I've never seen an abstraction layer that could handle MongoDB and later be converted to Postgres for example (a real migration I had to do once) and still do justice to either backend.

2) Hopefully you pick a storage layer for a good reason. For example, say you picked Postgres because you have Geospatial needs. Business later dictates you have to use MySQL "for reasons", what abstraction layer is going to help there to not require code refactor?

3) If you are doing things right, there's more than App code that interfaces with the DB. You've got the entire devops chain (backups, automation, etc) that will likely need to be rewritten. This can take an immense amount of work as well.

Anyway, not saying it's not possible, but I've found it a good litmus test of another engineer if they think moving to another storage layer should be simple. It's not always, but still more often than not shows inexperience or over optimism.


That's interesting. I have successfully designed systems which could be easily migrated between different RDMS' with minimal changes. Obviously to go to no-sql or other less conventional storage is a different story.

It is an interesting question about how successful people are in migrating to different storage engines.


> I have successfully designed systems which could be easily migrated between different RDMS' with minimal changes.

Not that I doubt you, because I know with the right trade offs it's possible, but is the operative word "could" or did you actually ever migrate one of those architectures?

> Obviously to go to no-sql or other less conventional storage is a different story.

Ruling those out feels a bit disengenuous. There are plenty of good reasons to need to migrate to or from a relational DB.


Could you give some specifics as to what kind of stuff typically needs refactoring if you switch storage layers?


Even if you stick to the SQL standard as much as possible engine specific syntax almost always creeps in unless you routinely test against all target systems from the start (which you might not be able to justify the time to do on many projects where being platform agnostic is not initially a core high-priority requirement).

And even where there are no feature or syntax issues there may well be optimisation differences. For instance going from MSSQL to postgres you might hit a significant difference with CTEs because MSSQL can perform predicate optimisations through them but postgres doesn't - this might mean core queries need to be refactored significantly for performance reasons (to avoid extra index or table scans) if not functional ones. (not intending to pick on postgres here, I'm sure there are similar examples in the other direction and between other engines, but this is the most significant example that immediately springs to mind).


Stored Procedures, any use of the RDBMS SQL extensions (eg: BETWEEN)


I partly answered your question in a comment above. If you'd like me to go more in depth, happy too!


My thoughts exactly. I don't do a ton of DB programming, but I've only ever written one thing in a non-agnostic way. We got a requirement that users wanted to copy an entire "project" which was the top level of a hierarchy. I wrote an Oracle routine to do the deep copy. I did it because I imagined what the PL/SQL would look like (clean) vs. what the Java code would look like (considering Hello World in Java is ugly, you can see where I'm going ;-)


Mine too! Having logic embedded into SQL functions seems to be an anti-pattern to me (it's harder to maintain and harder to do release management). While it's great that Postgres can do this (btw, I love Postgres), I suspect there aren't many people will use this feature in production environment.


I used to think this, but have relaxed that view over time.

For constraints, validity checking, things like adding/updating timestamps, and other things that are about data integrity, about the only time I don't do that in the DB is when outside information is involved such that it can't be. Otherwise, to the extent possible, I want the datastore to only accept valid data. This goes to the notion that fixing code is easier than fixing data, so the store can not only defend itself, but also help catch bugs.

There are also times when dealing with huge amounts of data that doing whatever you're doing in an SP is the only way to get decent performance. Dragging enormous tables over the network to process is sometimes really wasteful. If you're tuning indexes and whatnot against that model, you're already changing the DB, and doing so at a level that is implicit rather than explicit, and in ways that can change out from under you (if the statistics change).


> it's harder to maintain and harder to do release management

How is this any different than anything else where DB data & application logic have to be kept in sync?

version control + loader/change scripts should handle it, no?


You should see what large Oracle shops in the enterprise segment do with PL/SQL, though whether they have gone too far is open for debate...


Yes they have. And then it's full on vendor lock-in.


Actually, I'm under the impression it's way more common to migrate language or framework than data storage. I externalize most of my business logic to postgres for that reason (plus, it's incredibly performant, especially since you don't keep requesting connections from connection pool for each single part of the computation). My backend app handles request sanitizing, providing endpoints, etc. All data processing is made in the database. Love it.


I agree but have you read BoiledCarrot from Martin Fowler? https://martinfowler.com/bliki/BoiledCarrot.html


> If ever you decide you want to migrate away from Postgres, you'll need to rewrite all the business logic into some other out of process language.

Of you follow this rule you'll never be able to use the most powerful features of postgres.

> If ever you want to scale this, such that you want to calculate states in batches, you'd want to have the business logic somewhere else...

Not due I'm totally on board with the premise here, but the need to calculate state outside the database is a generally valid one and so it might be a good idea to write the core of the transition function in JavaScript and implement the postgres functions in PLv8.


If you ever decide you want to migrate away from Postgres, it's because there's another platform that supports significant functionality that Postgres doesn't. Changing out the data tier is not a decision made on a whim.

If there is significant functionality we want to take advantage of in this new platform, then that means that we have to refactor anyway to use this, because by definition we either don't do it today, or we do it with an insufficient workaround in Postgres.

The migration necessitates large refactoring by its very justification.


This implies that a need to migrate away from Postgres will arise. For most companies it is very likely that such a need will never materialize. Meanwhile Postgres keeps getting better and better. https://wiki.postgresql.org/wiki/New_in_postgres_10


It's not just about migration. There are deployment issues as well.

I like to think of each layer in a tiered system as having differing deployment requirements and time tables. Front-end systems will be very frequent, backend systems possibly less so, though not necessarily, and then DBs ideally infrequent and they generally take longer.

At a minimum, keeping them decoupled is freeing for patching bugs and releasing features independently. It does raise the bar for keeping all changes compatible with existing systems.


There are good deployment tools for databases, e.g. ones for which all of the database 'code' objects (or just all of the objects period) are maintained in source control and updates are either automatic or scripted.

Of course updating a database is much harder than overwriting executable or library files, but a lot of the objects in a database should be safely updatable by simply dropping and recreating them.

The tricky changes are of course things like, e.g. splitting one column into two or merging two columns in two tables into one. But those changes are even harder to do the 'dumber' your database is, i.e. the less logic there is in it that enforces a certain level of quality in its data.


A great one is http://sqitch.org


Yeah, I've seen that before and it looks promising.

My favorite, and the only one I've used extensively, is [DB Ghost](http://www.dbghost.com/). What I like about it compared to all others I've run across is that it, by default, will automatically sync your target DB (e.g. a production DB) with a model source DB that it also builds automatically.

So instead of scripting out every schema change as explicit SQL statements and queries you just maintain the scripts to build the model source DB, e.g. to add a column to a table, instead of creating a SQL script file to `ALTER TABLE Foo ADD COLUMN Bar ...` you just update the existing SQL script file with the `CREATE TABLE Foo ...` statement. When you deploy changes – 'sync' a target DB in the DB Ghost terminology – it automatically detects differences and modifies the target to match the source.

The benefit being that neither you nor the DB Ghost program needs to explicitly perform every single migration since the beginning of time. Only changes that need to be explicitly handled as migrations, but, with a little customization, doing that is pretty easy too.

The bar now for me working with databases is whether I can create a new 'empty' database (with test data) in a minute or two and whether I can automatically deploy changes (or, as I'm doing now, generate a deployment script automatically). Given that, I can actually do something like TDD for databases, which is really nice, especially if there's significant business logic in the database (which there almost always is in my experience to-date).


If ever you decide you want to migrate away from Postgres, you'll need to rewrite all the business logic into some other out of process languag

You'll rewrite your front end 20 times in 20 different "frameworks", and your app 5 times in 5 different languages, for everytime you actually change databases


I've built inventory systems in a similar fashion.

On one hand, using a bunch of PL/pgSQL makes sense for transaction isolation and faster execution. I'm not sure how much of the arguments about business logic matter in this case, but a strong argument for using PL/pgSQL is that these queries written in Python (for example) are going to be significantly slower, and that really matters when there are 100 concurrent users hitting the database every 10 seconds. No one likes waiting for their system to update... they'll just switch over to Excel. I think that PL/pgSQL is a great use-case for situations where preoptimizing for speed isn't a mistake, though the trade off is that you need to find the rare expert on processing languages (what order to triggers fire and how do you prevent this from being an issue?), and who can work through the complex logic of the system.

I wonder why the author didn't use windowing instead the lateral example. You can window and sum over composite rows, and it goes pretty fast. I never did a direct comparison between lateral and windowing, but windowing would be much cleaner and not require you to generate a series.


Despite being pretty familiar with window functions, I couldn't figure out how to do it for that example in the time I was willing to spend on this post :). If you have a better query, I'm happy to update the article and give credit.


I love seeing SQL being "abused" in this manner.

It's almost as crazy as the Excel spreadsheet I found to simulate a neural network.

I've always loved seeing how far you could push SQL and databases to do things that (on the surface) would seem extremely difficult or impossible.

What it usually turns out to be is possible - but in the long-term unmaintainable. This FSM is not that - not yet (if it gained even more states, it could get there). But I have seen (and I have unfortunately written) queries in SQL that could make your hair stand up. Insane extreme monstrosities that I am both proud and ashamed of (fortunately, I don't work for that company any longer).

On a different but related note, I do recall one query that a friend of mine wrote to allow the querying of a zip code database (which had lat/lon columns for each zip code), to be able to calculate distances from a given address - using SQL. It was a simple form of geo lookup he had to do for a particular project. I later used the same code as a part of a lookup process for placing markers on a google map (markers would indicate whatever was needed for "show me locations near my address within N miles"). It was a pretty interesting piece of SQL for the reason of doing the specialized distance calculation (can't remember which calc, but one of the simplest ones that didn't take into account certain things about the earth's "roundness" to make things super-accurate - that wasn't needed for short distances and purpose).


Hah, I hear you :). Pushing this much logic into the DB can feel weird at times. And the application this is part of definitely has a lot of very complicated and large SQL queries.

That being said, I think things can be somewhat tamed by applying the same best practices to your SQL as you apply to your regular code: I.e. lots of tests, good comments and documentation, extracting logic into either set returning functions or functions suitable for lateral joins (both can be inlined if done right [1]), keeping things in version control, etc.

But yes, applying those practices can be harder in SQL than it is in your application layer language for various reasons. So I'll always recommend avoiding going too crazy. You can often get the best of both worlds by making pragmatic choices about what should be done at which layer. No need to enslave yourself to a false dichotomy.

[1]: https://wiki.postgresql.org/wiki/Inlining_of_SQL_functions


Cool trick. Should be an enum instead of text though. Also the transition table might be an actual SQL table as well instead of switch-cases in a function. - That way it remains a bit more declarative and you can do some meta-queries.


Yeah, maybe I'll update the post and mention ENUMs.

I like the type safety of ENUMs, and I'm actually using them in my application, but they can cause a lot of problems. E.g. it's currently almost impossible to remove an ENUM value [1], and adding a new value can't be done inside of a transaction.

The state transitions could definitely go into a table, but I feel it's overkill in many cases, and would have probably obfuscated the main ideas presented in the article.

Anyway, thanks for your feedback. There is definitely a lot more to explore for anybody who is interested in applying these ideas in practice :)

[1] https://www.postgresql.org/message-id/21012.1459434338%40sss...


Hm, I didn't know that about enums.

In the end it is just quibble about style anyway and one usually end up with what is measured faster in testing..


but if you were to push those states into a table, you'll likely end up with degraded performance due to the joins.

I agree with your comments about ENUMs, though, especially having built many "static" products that suddenly need their ENUMs changed, and having a group of developers visiting me and asking why it's not as possible as they had assumed :)


I disagree about enum. I've tried using it but I found that it's too hard to manage/migrate in postgres. Obviously there are a few different ways of achieving the enumish behaviour in postgres (or other dbs) — nowadays I just start with text with a constraint and upgrade to a real table if I need more detail.

YMMV but I don't think a blanket "just use enum" is the correct approach.


This is too hard to put into a sql file and run with your migrations tool?

    BEGIN;
        ALTER TYPE my_enum ADD VALUE 'bar' AFTER 'foo';
    COMMIT;


From the docs (as I suspect you already know): ALTER TYPE ... ADD VALUE (the form that adds a new value to an enum type) cannot be executed inside a transaction block

Also, removing a value is a pain (though, that may have changed now).

I know you can do it — but I've found after trying a bunch of approaches that text with constraints is a better starting point. Just wanted to point it out because it's something I researched a fair amount. Many guides say that enum can be a pain but I decided that I'd use them to be pure, then discovered they were a pain, and now don't use them as a starting point.


Yeah, this doesn't actually work. You'll get this error:

ERROR: ALTER TYPE ... ADD cannot run inside a transaction block


FWIW, this is fixed in the upcoming Postgres 10:

"Reduce locking required for adding values to enum types (Andrew Dunstan, Tom Lane)

Previously it was impossible to run ALTER TYPE ... ADD VALUE in a transaction block unless the enum type was created in the same block.

Now, only references to uncommitted enum values from other transactions are prohibited."

https://www.postgresql.org/docs/devel/static/release-10.html...


Sweet. That's certainly a nice improvement :). If removing ENUM values will be supported at some point as well, I can see myself recommending them unconditionally.


Would you be ok if dropping a value were to just mark it as 'deleted' from the catalogs, and no new values could be assigned? The reason enums have these weird restrictions is that they can appear in indexes, even after a value has been deleted (or its creation rolled back). If we don't know how to compare them after deletion, the index can't be correctly traversed anymore...


2 Questions! What's the index data structure that holds old enum values even after the value itself is not used anymore?

Secondly, what are your thoughts on using enum generally? Would you recommend them in cases where you don't need to really optimise for narrower columns?


> 2 Questions! What's the index data structure that holds old enum values even after the value itself is not used anymore?

It's not that index structure specific atm, even though it could possibly be avoided for some types (e.g. hash, although there's considerable visibility issues to make that possible). Consider e.g. a btree index, if the deleted enum value ends up in an inner page, you really need to know how it compares to know where to descend to.

> Secondly, what are your thoughts on using enum generally? Would you recommend them in cases where you don't need to really optimise for narrower columns?

I like using them. There's cases where the transactional restrictions make it too problematic, but other than that I think the documentational advantage is substantial, besides just the width.


Yeah, I think that'd be reasonable. My main motivation for using ENUMs is type safety and small column width. What you're proposing sounds like it would keep both of these advantages :)


> Also the transition table might be an actual SQL table as well instead of switch-cases in a function.

That would likely add an unnecessary read and slow things down just a little bit.


It depends. The table will reside in memory and it will just be a scan over a few bytes. Possibly cheaper than firing up the PL/pgSQL interpreter.


Since FSMs seem to make sense in some cases while implementing logic in both the server / database / client it would be interesting to create a language that would be a DSL which outputs transducers (Finite State Transducers) for each part. Using ideas from Functional Reactive Programming would be useful. Adding a schema for data within the transducers would be helpful. You would end up with something like react where logic would be shared and scaffolding could be created after making the server. This could possibly be augmented by CQRS and or Event Sourcing. I think functional programming would help (Clojure or F# would be my choices for implementation). This would also help with vendor lock in.


This was basically the background for http://github.com/jonnor/finito - still experimental.


Check out P?


For people interested in Prolog-like DB systems:

https://github.com/agentm/project-m36

> Unlike most database management systems (DBMS), Project:M36 is opinionated software which adheres strictly to the mathematics of the relational algebra. The purpose of this adherence is to prove that software which implements mathematically-sound design principles reaps benefits in the form of code clarity, consistency, performance, and future-proofing.


Author here: If the article above has you excited, come and join my team at Apple. We're hiring Go and PostgreSQL developers in Shanghai, China right now. Relocation is possible, just send me an e-mail to find out more about this role.

My e-mail is in my profile.


How is living in Shanghai?


Good question. I'm working remotely myself and can't relocate easily because my wife is a doctor and it's very difficult for her to move between medical systems. I was lucky to join this team when this wasn't considered a problem.

That being said, I'm in Shanghai frequently to sync with my colleagues and it's an amazing city. I've found pretty much most conveniences available to me in Berlin and to some extend even better. There is an active expat community, and a fascinating vibe.

My manager (from SF, USA) has been there for over a year now and has enjoyed the experience a lot and extended his stay. YMMV, but I'm fairly well traveled and would say that it's a pretty great spot if you're willing to emerge in a foreign language and culture.


This is cool however I think it may exhibit performance problems when used in the wild if the trigger constraint were to be used. (Could possibly gain some performance replacing a UNION with UNION ALL).

Also your colleagues may curse you down the line when they need to change the state machine's behaviour. Would of course be possible to keep note of a "state machine version number" on each row but you would end up with a transition function that had to know about every historic version of the state machine...


I have done very similar things in code rather in SQL. just a bit of curious, I understood it is not possible many to many connections, but why don't you allow from cancel->started again? The business logic is always tend to be changed. I think this is a bad example using FSM here. IMHO, I will only want to apply stable, constant and (code)internal FSM to pgSQL. (Thought about a joke how to kill a programmer just needs change the requirements three times. lol) Good article anyway.


Another interesting idea is to implement applications in SQLite3 (all local). IIRC libgda (GNOME GDA) uses SQLite3 to drive LDAP and other protocols.

This will get much easier to do now with the new "pointer passing interface" in SQLite3 3.20.0 (http://sqlite.org/bindptr.html).

I'm not recommend this, nor against this. Just observing that it's a legitimate design pattern.


".. come and join my team at Apple. We're hiring Go and PostgreSQL developers in Shanghai, China.."

Woot? Apple uses Go?


Yeah :).


Oh no... I commend your effort, but this is totally ill-conceived. What's even more disheartening is the number of people who looked at this and also thought it was a good idea.

There is just so much wrong here... let's start with the basics. If this is truly a FSM then why on earth are you using a transaction table (state is a value)? An accumulating snapshot table (state is a field), besides being EXACTLY for this purpose, will be far more efficient in almost every way. Your "analytic" queries would be simple select statements, and your trigger (shudder) could be reduced to simple constraints. The sheer amount of over-engineering here is staggering.

Lastly, what is the purpose of putting the transition logic in the database? It's simply redundant upon actual implementation. Somewhere, somehow, another program has to actually carry out your actions (paying/shipping/cancelling) on the order and make a call to your database with the appropriate event inputs. So why not just put the entire state machine with the rest of your business logic? As many have pointed out, this is where it should be anyway.


TBH I'm not sure if I follow your argument. You can certainly apply the main idea of this article (modeling a FSM as a user defined aggregate) to automatically materialize the latest state of each order. In fact, that's what we're doing in the app we're using this technique in.

Anyway, YMMV and I'm not recommending to apply the ideas in this article in every situation. But after having had all of this logic in the application layer before migrating to Postgres, I find this approach more maintainable.


What I'm getting at is that the author went ahead and implemented a bad solution to a simple problem because it seemed "cool". This is a great example of what NOT to do. A simple accumulating snapshot table with a bit field along with a timestamp for each state and some simple constraints would do the exact same thing more efficiently.

Furthermore, I'm challenging the utility of putting the transaction logic in the database. It doesn't make sense other than to seem elegant. His application needs to actually respond to the events so it's just silly to separate the transitions. This is a complete redo.


I don't understand your alternative solution, nor your negativity. Would you still keep the events table, or just have a single orders table with the latest state?


Reading back my own comments, I must apologize for my tone. I didn't mean to come off so negative. In fact, I do think your FSM is pretty cool.

So let me explain in more detail how you could solve this without all of the overhead you have created. Assuming a true FSM, there is no need to store the state as a value in your database (e.g. "awaiting_payment" in a [state] field). Instead, you should be storing your states as BIT fields:

[awaiting_payment], [awaiting_shipment], [shipped], etc

where "1" indicates the current state and every other state field is "0". In addition, you will need a nullable timestamp for each state:

[awaiting_payment_timestamp], [awaiting_shipment_timestamp], etc

to store when each state was entered, if at all. In this way, every order is in a single row and your analytics become trivial queries that don't require any complex joins or correlated sub-queries. Additionally, using simple constraints on each state field (e.g. the [awaiting_shipment] field <> 1 if the [awaiting_payment_timestamp] is NULL) would remove the need for your trigger and, again, reduce overhead.

This is called an "accumulating snapshot" table, and it is used in data warehousing to solve the exact problem you are dealing with here - tracking an entity through a series of predetermined events to monitor it's status and mark milestones. Using a "transaction" table like in your example, while certainly possible, is storing your data in a way that you cannot easily answer the questions you may want to ask of it. You are spreading your entity across multiple rows instead of putting it all in one row.

Think about how the data translates to an object in your domain. Let's call this table "order_state" ("order_events" is a bad name for two reasons: it's plural - which is just wrong - and it doesn't faithfully describe your entity). I'm going to assume "order" and "order_item" tables as well. These three tables create an aggregate root that can be used to derive an Order object in your domain. This Order will have, at the very least, 3 properties corresponding to your entity's attributes in your database:

id, items, state

where "items" is a collection of OrderItem and state is OrderState (see where I'm going here?). The question becomes about how the OrderState is populated. If your Order object represents the "context" in your FSM, then the OrderState is the base class for all states:

AwaitingPaymentState, ShippingState, CancelledState, etc

This means polymorphism is inevitable. There are 3 common ways to deal with polymorphism in a database, but using a transaction table (like in your example) is not one of them. For your problem, utilizing "single table inheritance" (where all objects in a hierarchy map to a single table row) makes the most sense because you are not storing different data in each child class, rather, just a marker to indicate which state is current (this is also generally the most performant solution). It also means you don't need to add any additional tables, because your "order_state" table can serve both purposes.

Using a transaction table, you are adding a 3rd orthogonal dimension to your aggregate (another set of primary keys), so your object graph becomes needlessly large and it becomes a chore to choose and hydrate your current state.

My last critique is in regard to the choice of putting your transition logic in the database itself. I just don't see how that can be useful other than to "seem" clean. What I mean is, somehow an order actually has to be shipped or an item has to be paid for, and adding a row to a database table doesn't do that. So somewhere else in your stack you must have duplicated your transaction logic in some fashion in order to facilitate the correct database calls when the events take place in your domain.

For example, using my above objects, when a the "pay" event is raised on your Order object "context", it will invoke the "pay" event on it's current state, AwaitingPaymentState, to undergo transition. In this scenario, there must be some method (an "action") that actually handles the payment. Either there is no state machine at all in your domain and you need to validate the state before processing the payment, OR you have a state machine like my example and you need to transition to the AwaitingShippingState. In either case, the logic is duplicated.

What you have created is a cool academic solution that may work (I don't doubt that it works), but this problem was solved 30 years ago and if this was done on my team, I would insist it be redone to adhere to the basic principals of the entity relationship model and database design.

Does this make more sense?


Hi, thanks for taking the time to explain your ideas.

There are certainly many ways to model this problem in the database. But to me the main idea of the article is implementing a FSM as a user defined aggregate. This is technique is actually not mutually exclusive with an "accumulating snapshot" table, and in fact the trigger could be modified to maintain such a table instead or in addition to the order_events table.

I don't quite understand why you insist on not having a transactions (order_events) table. IMO it's very helpful for auditing purposes, and can be used to capture additional data for each state transition. E.g. you could easily put an IP column on it. From what I can tell you'd have to add one column per event for doing this with your accumulated snapshot table. Our application has ~10 events, so I'm not sure I'd like to have an explosion of columns on it.

Also, I think your solution falls apart completely when it's possible to re-enter a previous state. (or would you overwrite the first timestamp?)

Last but not least, there are many advantages to putting logic into the db. I know you object to the advanced analytics because they can be done differently when you control the DB schema from day one. But imagine you inherit a database with an events table that can't easily be modified to use an accumulating snapshot table. How would you write analytical queries against it? Another advantage of keeping logic in the database is reducing the number of queries (and network overhead) that are needed.

Anyway, you've made many good suggestions, but I think you've taken the toy example a bit too serious. It's very hard to come up with good real world examples without disclosing details of your actual application (which I can't in this case).

Anyway, using FSMs backed by user defined aggregates is definitely not an academic solution. This powers a mission-critical application of one of the worlds big companies for a database with over a billion records. It's also not the first iteration, and previous implementations at the application layer had caused problems that FSMs inside the DB have solved. Last but not least, this modeling the problem domain as FSMs has allowed the entire team (including non-technical people) to deeply and correctly understand how our analytics work.


Given the problem in the article, an accumulating snapshot table is simply the best solution for the reasons I laid out. If your actual system has different requirements in terms of storage and analysis (which apparently it does), I would be happy to hear about those and come up with an updated recommendation.

Along those same lines I'm not insisting you can't have a transaction table. If you need to record more information, like an IP address or user_id with each transition, then it would require a transaction table (although the article mentions none of this).

Honestly, the storage concerns I bring up (that is the schema) are really far less of a concern than the basic idea of storing your transitions in the database. Put simply, you are storing the wrong information. Instead of storing the actual state, you are storing a series of inputs that allows you to derive the state. This is a HUGE mistake. What if you add a new state? Say, [in_transit]. Every one of those 1 billion rows that are already in there will no longer be able to be processed correctly because they won't have the correct order of events.

This entire thing could be one table that stores the [order_id] [state] [timestamp] [is_current], with a stored procedure to update it that takes an "order_id" and "event_name". Instead, you have created a house of cards.


Our actual system has the requirements of an existing DB schema that can't be changed, that's already using an events table, and that needs to be analyzed. I wrote some more about it here [1]. The FSMs are also cyclic, which can't be handled by your approach as I pointed out earlier.

Anyway, coming back to the example from the post, if I change the FSM (e.g. add a state/event) I need to make sure all existing data conforms to it one way or another (by making the change backwards compatible or by migrating the data). But the same problem exists for any other invariant you're trying to enforce/add to your DB (FKs, Checks, etc.). It's just the cost of enforcing an invariant.

I'm also confused at this point how you jump between saying it's okay to have a transactions table and then later backtrace and say it's a huge mistake because it's not storing the actual state. I already pointed out that the two are not mutually exclusive, and that the trigger could maintain an accumulated snapshot in addition to the order_events table. In fact, the trigger could event derive the new state based on the latest value from the accumulating snapshot table which may also address some of your concerns about adding new states.

Also, I haven't created a house of cards. I had the novel idea of for leveraging user defined aggregates as finite state machines and decided to share it with the PostgreSQL community. I couldn't use my companies business domain as an example, so I had to come up with a bogus one that people would understand easily. Now don't get me wrong, I'm totally enjoying the discussion around this bogus example, but I'm not enjoying your unwavering faith in the absolute correctness of your ideas and the HUGE mistakes you claim I'm making. So is is probably my last reply unless you're cool with toning it down a bit.

[1] http://disq.us/p/1l10by0


If you are bound to an existing database schema then there isn't much you can do (depending on how bound you are of course). That's a totally valid reason to choose your current path. I'm not naive to business requirements. Often, systems end up being far from theoretically perfect due to legacy {insert component of stack here}. I get it.

So ruling out a snapshot table (I think you have made clear this is not an option), your best choice would then be a transaction table where your [state] is a Type 2 Slowly Changing Dimension (SCD2) with an indicator field for the current state. I want to be clear here: you should be storing your state, NOT the events (arguments) that can be used to derive your state. I cannot express this enough. Your problem is not unique. You simply cannot be naive enough to think you are the first person to want store the state history of an FSM in a database. This is literally data warehousing 101. The novelty of your solution stems from the fact that you are (dangerously) misunderstanding the problem. If, for some reason, you need to store what [event_name] caused the transition to the [state], just add a field. Simple.

As for invariants. Let's use an example: say you add a "package" event that transitions to an [in_transit] state between [awaiting_shipment] and [shipped] (note here we may also then want to change "ship" to "deliver"). This simple and totally reasonable change would make every order with an event queue: "create", "pay", "ship" result in [error]. If you just stored the [state], then a [shipped] order will remain [shipped] regardless of what series of events brought it there - which would be correct. There would be no need to go back and change historical data - nor would you want to. How would that even be done? Manufacture shipping information and timestamps? There are no invariants in this process (again, I want to reiterate that this is a solved problem). True FSMs (if this is actually modeled as one) only care about the current state anyway and maybe the previous state (see SCD3). What I mean is, future actions are determined using the current state. If the current state becomes [error] for 1 billion orders, that's a problem. A huge problem.

"Absolute correctness" is an interesting term... and I agree that I am probably coming off a bit aggressive in my comments. Look, I'm not trying to be a bully here, and I DO think that the FSM you have created is interesting, unique, and a cool example of how user-defined aggregates can be used. The problem with it is, unfortunately, an extremely common one. If I had a nickle for every time a programmer jumped into the world of databases and came up with a "new" solution to a "never-before-seen" problem... I'd be a wealthy man. Relational databases have been around for a long time and are one of THE MOST studied and written about pieces of infrastructure/software in existence (they're interesting for so many reasons!). What this means is that it's EXTREMELY unlikely you have found a problem that hasn't already been "solved" (of course there can be wiggle room) in a theoretically correct manner. It's almost always just an individual not recognizing (or breaking down) the problem efficiently.

And that's what I am seeing here: a problem that is present and handled in LOTS of databases (I've personally done this dozens of times), but is being approached from a less-than-perfect (albeit unique) angle. I can't say that I'm "absolutely correct"... I don't have the full grasp of your domain and business requirements. I don't know precisely the degree of constraint your schema is imposing. I don't know how well communication between teams and management of your stack is handled (yes, these can be important - you did mention you had problems in these areas). Maybe yours really is the best solution - though I remain unconvinced. But I can say, that given what I know about the problem, the entity relational model, database design, and data warehousing that your solution is simply not sound. It's difficult to put that nicely. I don't mean that your wrong (apparently this is working for you). It's not black and white. Just that if one of my clients came to me with similar requirements, I would follow "the book", because this has already been written.

Here's what I'd do (SCD6):

    Order [OrderId] ... [additional data]

    OrderStateHistory [OrderStateId] [OrderId] [TransitionTimestamp] [PreviousStateName] [EventName] [StateName] [IsCurrent] ... [additional data] 
And if you want a transitions table for your FSM to enforce explicit constraints (I don't recommend this, but it's MUCH better than hiding this data in a function/trigger):

    OrderTransition [CurrentStateName] [EventName] [NextStateName]
    
You can figure out the keying/substitutions. Then use a simple stored procedure:

order_state_history_tr @current_state_name, @event_name -- note this the PK of OrderTransition

to facilitate transitions. This stores BOTH the transitions and the state.


Relevant article detailing how MSSQL is implementing history tracking infrastructure out of the box for the 2016 release (SCD4 for obvious reasons):

https://docs.microsoft.com/en-us/sql/relational-databases/ta...


I'd like to hear more about how you would use constraints in Postgres to control state transitions in the accumulating snapshot table.


Off the top of my head there a number ways to achieve this: The first would be to use CHECK constraints on your state fields to ensure that the appropriate states have timestamps to ensure they have been reached. Other logic could be included as well if necessary.

A second option would be to have a transaction table to record each transition (and a reference table containing all valid states), and then have foreign keys in the accumulating snapshot table to the transaction table. Although this would be kind of silly for the current problem due to its simplicity.

A third option would be to use an [is_valid] calculated field to record whether or not the state is valid.

I'm sure I could come up with more, but it's all a silly exercise because I would do (and recommend) none of these and keep business logic where business logic belongs. This would allow for more flexibility for different "kinds" of orders, more complex validation, etc.


I do this sort of thing, and I highly recommend it, especially with PG. Another thing I like to do is to use queries against the pg_* tables to generate code/metadata for components of the application running outside SQL.


Another useful thing to do is to take advantage of record types. PG SQL is approaching something that one might call higher-order SQL.

When you can query the SQL schema using SQL queries (though I admit that the pg_* tables kinda suck) and when you have things like record types, you really do have a very powerful SQL.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: