Hacker News new | past | comments | ask | show | jobs | submit login
Rich Hickey on Datomic, CAP and ACID (infoq.com)
113 points by sethev on Jan 21, 2013 | hide | past | favorite | 26 comments



There's a bit of a barrier to entry for Datomic because it really doesn't look quite like anything that you're familiar with if you're just used to SQL and NoSQL. The basic starting point is one big many-to-many relation, what in SQL might be one big table of (db_id, subject_id, verb_id, object_id, timestamp) of these "facts" which are (subject verb object) tuples, and these queries can be limited by the transaction timestamp. I'd say more about it but honestly, because it's closed-source, I've had real trouble figuring out what exactly is going on in the backend; a lot of the organization is done with namespaced keywords so I am not sure how exactly the datatypes are organized, but you can query based on subject, verb, or object quite rapidly. As Rich Hickey says, one instrumental thing is that these tuples are only inserted -- they are never updated or deleted. This enables a "many readers, one updater" architecture without violating consistency when you're updating the data: nobody uses a timestamp greater than their recorded one until the model (in the MVC sense) tells that view that it's ready to update. There is a mechanism for naming new elements in your transactions so that you can insert a bunch of facts at once.

It's also a little difficult because it is substantially meta -- instead of a special syntax ("CREATE TABLE" etc.) for handling structure, the information about how the database is organized is stored in the same "fact" tuples that make up the rest of the database, just with some automatic verbs that come bundled in. I'm always a fan of self-expressing systems, but that might throw some newcomers for a loop.

I guess my main comment about Datomic is, "I wish there were an open-source version out there." I'm interested in seeing what the plumbing of such a system looks like and perhaps learning something from it -- and I'd like to peek at their Datalog engine and so forth. Unfortunately, just from peeking inside the JAR, it appears that one might need a good understanding of Google Guava and Apache Ant and the Jetty server and a bunch of other things to be able to read the source, which might make it prohibitive to peek inside the code. I'm more interested in learning from the system than I am in using it.


I share your interest in the Datalog engine implementation. The free version of Datomic does include Datomic Datalog and you can certainly play with it against the normal Clojure data structures without bothering with the backend.

You might be interested in Cascalog, which is a similar, datalog-style query engine designed for use on Hadoop. But it can be run locally with a play data space. The implementation is open source and you can peruse it nicely.

fogus is working some on bacwn:

https://github.com/fogus/bacwn

which may also hold interest for you.

Sooner or later I'm going to get back on the prolog-y, datalog-y chapter of ANSI Common Lisp as well.


The description you have for datomic is pretty much spot on for how graph databases work, with schema and data both stored as data, subject verb object tuples, etc.

It's sad that this is a new concept for so many people in the software industry, considering that graphs are one of the core data structures taught about in CS.


Like most technologies, Datomic builds on ideas and concepts that already exist, and would find it hard to set itself apart by simply latching on to a single idea or concept. So if it were "just another" graph database and nothing more, I'd agree with you.

However, Datomic puts together a whole host of other concepts - data immutability and all its resulting benefits; separation of query, storage, write, coordination (with carefully considered trade-offs); Datalog-based querying in place of SQL; in-app access to data with a great solution for the string concatenation hell that we've all gone through trying to juggle our SQL, and the list goes on.

I don't think any of these individually make Datomic special, but collectively? IMHO, outstanding!

Edit: For those interested, we have a new Datomic Community on G+ where we have collected a number of links to Datomic resources and videos: https://plus.google.com/communities/109115177403359845949


datomic claims to solve the O/R impedance mismatch -- the "vietnam of computer science"[0] -- which is why datomic matters to application developers.

"... The facts the Storage Service returns never change, so Peers do extensive caching ... Once a Peer's working set is cached, there is little or no network traffic for reads ..." [1, paragraph 4]

this basically means that your reads are VERY fast -- in-memory fast -- which means you can program as if your data wasn't in a database.

[0] http://www.codinghorror.com/blog/2006/06/object-relational-m... [1] http://docs.datomic.com/architecture.html

ps video is only 14 minutes long, worth a watch


From http://www.datomic.com/:

    Datomic is a database of flexible, time-based facts, supporting queries and
    joins, with elastic scalability, and ACID transactions. 

    Datomic can leverage highly-available, distributed storage services, and puts
    declarative power into the hands of application developers.


the more i see people speak about databases this way, the more i learn to appreciate chris date

almost all the explanations i've seen of datomic explain the software architecture or component (check this for example http://www.datomic.com/overview.html)

the more important aspect of a db is the conceptual data model, the abstraction it offer , the one which you as a developer will use to describe , create and query data and information

the relational model is not about the components of a DBDMS, its not about indexes, and optimizer, its about describing your data as relations, using relational operators and is even more primarily about data integrity

if it is not an implementation of the relational model, what is the model its trying to implement, and why should we use this data model?


I think partly this is because Clojure programmers already understand the ideas behind Datomic's data model, since it is somewhat of a natural extension of Clojure's model of state.

Rich has given talks on the architecture and talks on the data model, and at the Conj Stuart gave a talk on the testing infrastructure. You might poke around for a different talk if you haven't found one that touches on it yet.


I feel the same way about Datomic too. I really want to be excited about datomic, but I just can't. Rich Hickey is a smart dude, no doubt, but I just don't see anything amazing here with datomic. Every time I try to learn about what kinds of problems it solves, all I can find are descriptions of how different it works under the hood, how different it works from the perspective of the programmer.


"the more important aspect of a db is ..."

No. To you it's that.

To me the ability to be able to easily recreate the state, any state the DB was in at any time is much more important.

A big part of my job consist in asking DBA a dump of the production DB to PREPROD or DEV environments so that I can try to recreate the state at which the shit hit the fan. And it's more than painful. And it's not my fault: I'm inheriting apps that I have to maintain/bugfix/enhance. And it's hell. Mutability hell.

One day CRUD DB shall be regarded as dumbly as languages in which any variable can be globally modified, from any thread. We'll look back at these CRUD DB and wonder how stupid we've been not to listen earlier to the ones advocating a saner world.

You really should listen to several of Rich Hickey's talk because his ideas are more than sound and comes from a lot of Real-World [TM] immense pain suffering.

All the time he says: "Have you ever ...? Not fun!" he's 100% on spot. He's language and DB are the most pragmatic things ever.

It's amazing.


Conceptually this is not so different than a regular database and its log. The log is the immutable, write-only part, and the tables are the current representation. Datomic moves the current representation further up the stack.


You can implement an immutable database system with SQL databases today. For instance, just design your application to only add new records, and not to issue any updates.

Why is an entirely new database system needed?


Datomic does a lot more than just avoiding updates/deletes. It splits the database into multiple parts. The transactor doesn't do any data storage. The clients handle all their own query locally. This means that the transactor (the only bottleneck in the system) does not have to handle any queries or storage and can spend 100% of its time handling write transactions. All other components of the system are trivially cacheable and scalable.

The other big advantage is the concept of "the database as a value". As a client, you can easily obtain the database as an immutable value right inside your application. This allows you to do all the queries you want without affecting anyone else.


You can right now rewind a database to an arbitrary point in time right off the transaction log. The feature exists in the technology, the vendors just haven't productized it. What does that tell you? That the demand for your use case is commercially insignificant.


Or it's more difficult to apply in practise than you believe.

Being able to create a duplicate of the database at a past date for debugging purposes would have been extremely useful in my previous job.


I have immense respect for and look up to anybody who can do something as complex and low level as DB kernel development (mem management, persistence/file system, caching), especially with the long list of features you have to support to make the DB desirable/useful (ACID transactions, connectors).

I also hope DB implementers know how to get acquired by big, companies, because I don't see how they can really compete with these big legacy companies that have been developing their core DB's for decades and have an army of support/sales to back it up. Not to mention, where's the safety and accountability with mission critical data with a new product versus say DB2 or oracle? Maybe the new DB is a better implementation/more fault tolerant, but there's no big corporation to blame if something goes wrong, rather a small company you gambled on.


One thing I haven't heard discussed about the "save data forever" model is that in some scenarios you WANT to purge old data and not have it recoverable, for legal reasons.

I guess Datomic is not suited to these situations?


This seems to be a frequently asked question. From what I've seen on the videos and forums, the Datomic team is also well aware of situations in which this may be legally necessary and are working on providing ways to handle this.


I believe you can set some fields to not store history

https://groups.google.com/forum/#!msg/datomic/WlRM3aXwzIg/RT...


I think most people don't know that Datomic can be seen as a layer between your code/date and nearly any backend (including SQL: for example you can use PostgreSQL "behind" Datomic).

"CRA" (Create Read Append) is the future for 99% of all the application out there IMNSHO. I know most people don't believe it but you're simply not coming back once you understand how easy it is to "recreate the state" using such a tool. In addition to CRA, the fact that queries are forever cacheable is huge. The (lifelong) problem of "cache invalidation" simply doesn't exist anymore.

I'm not saying that Clojure is going to replace Java/C# or that Datomic is going to be used everywhere. But the concepts at the heart of these (like, say, lockless concurrency in Clojure and the "grow-only" property of Datomic) are what the future's gonna be made of.


I mostly agree with your sentiment, but,

> I think most people don't know that Datomic can be seen as a layer between your code/date and nearly any backend (including SQL: for example you can use PostgreSQL "behind" Datomic).

While this is true, it doesn't matter. That Datomic is backed with Postgres doesn't matter to me unless I can query it with all the same functionality as I do Posgtres. Which I can't. It's just a dumb store (like all the backends), so this is an meaningless feature except to people running operations.


> "CRA" (Create Read Append) is the future for 99% of all the application out there IMNSHO

How about counters e.g. number of upvotes ?

It's typical upsert operations but needs powerful aggression for sorting.


Nathan Marz has been promoting this as the Lambda Architecture, has a couple presentations/blog posts about the idea and is writing a book. While other people probably have similar ideas, I'm unaware of anybody else attempting to teach them in a similarly cohesive manner. I don't know what I'm doing but I'm working my way through understanding this area so I'll work through the queries as an exercise.

For your counter, you'd record the individual votes as tuples of (timestamp, voter, vote, subject). These get dumped into a distributed data store. From there, they get batch processed into the equivalent of a database view. The cascalog query for your 'posts' view would be something like:

    (defn post-view [post-source vote-source output]
      (?<- output [?timestamp ?postid ?upvotes ?downvotes ?title ?body]
           (post-source ?timestamp ?postid ?title ?body)
           (vote-source _ _ ?vote ?postid)
           (> ?vote 0 :> !up)    (c/!count !up :> ?upvotes)
           (< ?vote 0 :> !down)  (c/!count !down :> ?downvotes)))
Produces a set of (timestamp, id, timestamp, upvotes, downvotes, title, body) tuples. You can then do another pass on it to run your sort function:

    (defn post-score [now-ms time-ms upvotes downvotes]
      (let [mins-since (div (- now-ms time-ms) (* 60 1000))
            (div (- upvotes downvotes) (+ mins-since 1))]))

    (defn post-rank [post-view-source output]
        (?<- output [?postid ?votes ?title]
             (post-view-source ?timestamp ?postid ?upvotes ?downvotes ?title _)
             (- ?upvotes ?downvotes :> ?votes)
             (post-score (System/currentTimeMillis) ?timestamp ?upvotes ?downvotes :> ?score)
             (:sort ?score) (:reverse true))
This would give (id, vote total, title) tuples that you can send to your templating engine (assuming url is based entirely on id) to make the front page.

The neat thing about the approach is that you can change everything later. If, for example, you wanted to do weighted upvotes/downvotes, you could produce a scoring function for the up/down votes like the one used here for scoring posts and aggregate the votes using sum instead of count.


That's cool thanks!

But for a counter of millions, count vote action one by one seems inefficient. I think we still need some sort of `update` mechanism in a pure CRA db?


It's not particularly efficient but that's not a design goal of the system. The tradeoff is robustness–can't lose or corrupt data as long as the master data set is safe–and flexibility–generate whatever views you like on the data whenever you choose. The design is Twitter's analytics system which was running this sort of thing over a 27TB raw dataset using Hadoop so apparently it scales if you throw more hardware at it.

There's a second layer using Storm (not written up in the book yet, so I don't know the details) that handles all data newer than the most recent batch run and you somehow merge that new data with the old data (also not written up yet). I don't need to have this sort of system implemented immediately so I'm content to sit around and wait for new book chapters rather than try to muddle through.


If you don't count vote action one by one, how do you prevent multiple votes by same user?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: