I had no idea one could get a consistent read view across multiple queries withi...

richhickey · on Aug 29, 2012

Datomic allows one to get a consistent basis for multiple queries, separated by arbitrary amounts of time, outside of any transactions. Having to group queries motivated and conducted by different parts of your system into a single transaction in order to get a consistent basis is a source of coupling.

saurik · on Aug 29, 2012

Given that I need to share that common basis among the different parts of my system that need the consistent view, I am already taking the hit on that coupling: I am going to need to make certain that all of those parts all have access to the shared basis, whether it be a timestamp, a transaction identifier, or an active connection.

If the concern is coupling across space, you can store the transaction on the server in most existing databases. Past that, I would be highly interested in knowing the motivating use case that is causing the need for this much global and distributed snapshot isolation, but otherwise agree: I'd love to see more databases actually expose the ability to more easily "query the past" as well as request guarantees on vacuum avoidance.

richhickey · on Aug 29, 2012

I would characterize the level of coupling involved in sharing "the basis is 12345" as categorically different from having to nest database access within the same transaction. Consider, e.g. the ease of moving the former to different processes.

jeltz · on Aug 29, 2012

Agreed, but this is due to MVCC databases not saving the snapshots of committed transactions. If you have an active transaction in PostgreSQL you can start a new transaction using the same snapshot from another process. (This feature is exposed in SQL in version 9.2, and was motivated by the plan to implement parallel pg_dump.)

EDIT: You can find documentation of this at http://www.postgresql.org/docs/9.2/static/sql-set-transactio...

If you in PostgreSQL would save all historical snapshots and disable vacuum you could communicate any point in time simply with a snapshot number. Now this probably wont be very efficient since PostgreSQL is not optimized for this kind of use.

saurik · on Aug 29, 2012

This feature is very interesting! However, it is definitely not designed for this purpose, and is thereby fairly heavyweight: it is writing a file to disk with information about the snapshot instead of just returning it to the client, and the result is the filename.

Not knowing this existed, I spent the last hour implementing this feature in a way that just requires getting a single integer out (the txid) and then restoring it as the snapshot being viewed (but not changing the txid of the running transaction, which solves a lot of the "omg what would that mean" problems).

http://news.ycombinator.com/item?id=4448767

With this implementation, you can just use txid_current() to save a transaction snapshot and you can restore it using my new variable (which correctly installs this functionality only into the currently executing transaction's snapshot). (In a more ideal world, I'd re-parse the string from txid_current_snapshot().)

I didn't also solve the vacuum problems, but I think there might be some reasonable ways to do that. Regardless, I imagine that most of the interesting usages of this are not "restore a snapshot from three days ago" but more "share a snapshot between multiple processes and machines for a few seconds".

However, if you can get even one of those processes to hold open a transaction, you can copy its snapshot to other transactions in other sessions, and then even this naive implementation should be guaranteed to have access to the old data.

If you are using it only for these short periods of time, for purposes of "distributed and decoupled consistency", we also don't need to worry about the overhead of never running a vacuum: the vacuum process runs as normal, as we are only holding on to data very temporarily.

That said, in practice, you really can go for quite a while without running a vacuum on your database without issues, and I imagine any alternative system is going to run into similar problems anyway (log structured merge trees, for example, get screwed on this after a while, as the bloom filters become less selective).

You can't, however, with PostgreSQL, make this work "forever" (if you wanted to store data going back until the beginning of time) due to 32-bit txid wraparound problems :(. (This also should affect my silly implementation, as I should save the epoch in addition to the txid.)

jeltz · on Aug 29, 2012

To get a consistent view across multiple queries you just use the SERIALIZABLE isolation level. In PostgreSQL REPEATABLE READ also works, but the standard does not guarantee this (the standard allows for ghost reads since it assume you use locking rather than MVCC snapshots to implement REPEATABLE READ).

The ability to query any historical view of the data is indeed not there in PostgreSQL in any simple or reliable way. That is an advantage of Datomic, but I do not see why it would be impossible to implement in a "traditional database".

saurik · on Aug 29, 2012

The reason I claim this would be simple is that PostgreSQL is almost already doing this. The way the data is stored on disk, every row has two transactions identifiers, xmin and xmax, which represent the transaction when that row was inserted the the transaction that row was deleted; rows, meanwhile, are never updated in place, so the old data stays around until it is deleted by a vacuum.

To demonstrate more tangibly how this works, I just connected to my database server (running PostgreSQL 9.1), created a table and added a row. I did so inside of a transaction, and printed the transaction identifier. I then queried the data in the table from a new transaction, showing that the xmin is set to the identifier of the transaction that added the row.

Connection 1:

    demo=> create table q (data int);
    CREATE TABLE
    demo=> begin; select txid_current();
    BEGIN
    189028
    demo=> insert into q (data) values (0); commit;
    INSERT 0 1
    COMMIT
    demo=> begin; select xmin, xmax, data from q;
    BEGIN
    189028|0|0

Now, while this new transaction is still open, from a second connection, I'm going to create a new transaction in which I am going to update this row to set the value it is storing to 1 from 0, and then commit. In the first connection, as we are still in a "snapshot" (I put this term in quotes, as MVCC is obviously not copying the entire database when a transaction begins) from a transaction started before that update, we will not see the update happen, but the hidden xmax column (which stores the transaction in which the row is deleted) will be updated.

Connection 2:

    demo=> begin; select txid_current();
    BEGIN
    189029
    demo=> update q set data = 1; commit;
    UPDATE 1
    COMMIT
    demo=> select xmin, xmax, data from q;
    189029|0|1

Connection 1:

    demo=> select xmin, xmax, data from q;
    189028|189029|0

As you can see, the data that the other transaction was referencing has not been destroyed: the old row (the one with the value 0) is still there, but the xmax column has been updated to indicate that this column no longer exists for transactions that began after 189029 committed. However, at the same time, the new row (with the value 1) also exists, with an xmin of 189029: transactions that begin after 189029 committed will see that row instead. No data was destroyed: and this data is persisted this way to disk (it isn't just stored in memory).

My contention then is that it should be a fairly simple matter to take a transaction and backdate when it began. As far as I know, there is no reason that this would cause any serious problems as long as a) it was done before the transaction updated or inserted any data, b) there have been no vacuums during the backdated period, c) HOT (heap-only tuple) updates are disabled (in essence, this is an optimization designed to do online vacuuming), and maybe d) the new transaction is read only (although I am fairly confident this would not be a requirement).

For a more complete implementation, one would then want to be able to build transactions (probably read-only ones; I imagine this would cause serious problems if used from a writable transaction, and that really isn't required) that "saw all data as if all data in the database was alive", which I also believe would be a pretty simple hack: you just take the code that filters dead rows from being visible based on these comparisons and add a transaction feature that lets you turn them off. You could then use the already-implemented xmin and xmax columns to do your historical lookups.

P.S. BTW, if you want to try that demo at home, to get that behavior you need to use the "repeatable read" isolation level, which uses the start of the transaction as the boundary as opposed to the start of the query. This is not the default; you might then wonder if it is because it is expensive and requires a lot more coordination, and as far as I know the answer is "no". In both cases, all of the data is stored and is tagged with the transaction identifiers: the difference is only in what is considered the reference time to use for "which of the rows is alive".

However, it does mean that a transaction that attempts to update a value that has been changed from another transaction will fail, even if the updating transaction had not previously read the state of the value; as most reasonable usages of a database actually work fine with the relaxed semantics that "data truly committed before the query executes" provides (as that still wouldn't allow data you update to be concurrently and conflictingly updated by someone else: their update would block) and those semantics are not subject to "this transaction is impossible" errors.

Both Connections (setup):

    demo=> set session characteristics as transaction isolation level repeatable read;
    SET

krosaen · on Aug 29, 2012

Thanks for taking the time to elaborate, very interesting. I wonder if the sql db vendors or open source projects will take the next step to make querying against a transaction ID possible given the underlying implementation details bring it pretty close.

I also see Rich has made some interesting points elsewhere in this thread about consistent views being available outside of transactions and without need for coordination (within datomic) - seems more appropriate to comment directly there though.

Overall I think it's important to understand these nuances, and not view datomic as some revolutionary leap, even if I am excited about the project. I appreciate your insight into the power already within sql db engines.

saurik · on Aug 29, 2012

I am not certain whether your response comes from a reading of my comment before or after the paragraph I added that started with "for a more complete implementation", but if it was from before I encourage you to read that section: the ability to do the query is pretty much already there due to the xmin/xmax fields that PostgreSQL is already reifying.

saurik · on Aug 29, 2012

(edit: Apparently, nearly an hour ago, jeltz pointed out that PostgreSQL 9.2 actually has implemented nearly this identical functionality through the usage of exported snapshots, so I recommend people go read that comment and the linked documentation. However, my comment is still an example of the functionality working.)

(edit: Ah, but the feature as implemented actually saves a file to disk and thereby has a lot of server-side state: the way I've gone ahead and implemented it does not have this complexity; I simply take a single integer and store nothing on the server.)

http://news.ycombinator.com/item?id=4448472

> I wonder if the sql db vendors or open source projects will take the next step to make querying against a transaction ID possible given the underlying implementation details bring it pretty close.

For the hell of it, I just went ahead and implemented the "backdate a transaction" feature; I didn't solve the vacuum guarantees problem, however: I only made it so that a transaction can be backdated to another point in time.

To demonstrate, I will start with a very similar sequence of events to before. However, I am going to instead use txid_current_snapshot(), which returns the range (and an exception set that will be unused for this example) of transaction identifiers that are valid.

Connection 1:

    demo=# create table q (data int);
    CREATE TABLE
    demo=# begin; select txid_current_snapshot();
    BEGIN
    710:710:
    demo=# insert into q (data) values (0); commit;
    INSERT 0 1
    COMMIT
    demo=# begin; select txid_current_snapshot();
    BEGIN
    711:711:
    demo=# select xmin, xmax, data from q;
    710|0|0

Connection 2:

    demo=# begin; select txid_current_snapshot();
    BEGIN
    711:711:
    demo=# update q set data = 1; commit;
    UPDATE 1
    COMMIT
    demo=# select xmin, xmax, data from q;
    711|0|1

Connection 1:

    demo=# select xmin, xmax, data from q;
    710|711|0
    demo=# begin; select txid_current_snapshot();
    BEGIN
    712:712:
    demo=# select xmin, xmax, data from q;
    711|0|1

So far, this is the same scenario as before: I have two connections that are seeing different visibility to the same data, based on these snapshots. Now, however, I'd like to "go back in time": I want our first connection to be able to use the same basis for its consistency that we were using in the previous transaction.

Connection 1:

    demo=# set snapshot_txid = 711;
    SET
    demo=# select txid_current_snapshot();
    711:711:
    demo=# select xmin, xmax, data from q;
    710|711|0

This new variable, snapshot_txid, is something I created: it gets the current transaction's active snapshot and modifies it to be a range from that transaction id to that same id (I think a better version of this would take the exact same string value that is returned by txid_current_snapshot()).

From that previous basis, the row with the value 0 is visible, not the row with the value 1. I can, of course, go back to the future snapshot if I wish, in order to view the new row. (I am not yet certain what this will do to things writing to the database; this might actually be sufficient, however I feel like I might need to either mess with more things or mark the transaction read-only.)

Connection 1:

    demo=# set snapshot_txid = 712;
    SET
    demo=# select txid_current_snapshot();
    712:712:
    demo=# select xmin, xmax, data from q;
    711|0|1

campnic · on Aug 29, 2012

A couple questions about your example:

I noticed that the xmin changed from Connection 1 from 189018 -> 189028. Is that just a typo?

Is the concept of transaction in this regard a 'state of the entire db'? If a transaction included multiple modifications would they all get the same xmax? If so, I see this as a difference between the presentation and your example. The transaction is a modification to the entirety of the db and is a state of the db. In Hickey's presentation, he very clearly says that the expectation is that the transaction component of the datom is specific to an individual datom.

Since its been a while since I've worked on DBs, and even then I didn't know much, your demo has helped put it in perspective.

saurik · on Aug 29, 2012

Thank you so much for noticing that error. What happened is that I started doing it, and then made a mistake; I then redid the entire flow, but forgot I needed to re-copy/paste the first half as the transaction identifier would have changed. I have updated my comment to fix this.

Yes: the transaction is the state of the entire database. However, you can make your transactions as fine-grained as you wish: the reason to not do so, however, is that you are likely to end up with scenarios where you want to atomically roll-back many changes that other people using the database should not see in the case of a failure. You certainly then will at least want the ability to make multiple changes at once.

The tradeoff in doing so, however, is that you will need to make a new transaction, which will have a different basis. I agree: if you then have a need to be able to make tiny changes to individual items one at a time that need to be from a shared consistency basis and yet have no need to be atomic with respect to each other (which is the part I am going to be quite surprised by), then yes: you need the history query function to implement this.

I would be fascinated by a better understanding of that use case. Does he go into an explicit example of why that would be required later in the talk? (If so, I could try to translate that use case as best I can into a "traditional database" to see whether you really need that feature; if you do, it might be valuable to try to get something related to this design into PostgreSQL: I am starting to get a better understanding of the corner cases as I think more about it, and think I can come up with a proposal that wouldn't make this sound insane to the developers.)