What The Heck Are You Actually Using NoSQL For?

jeremyjarvis · on Dec 6, 2010

Most popular use right now is installing on [usually only] two nodes, then writing a blog post about how it's going to change the world, then going back to using good 'ol MySQL etc ;)

jforman · on Dec 7, 2010

We're (Inkling) successfully using Cassandra in production to store most of our data.

I wrote the first version of our data store modeled off of FriendFeed's architecture (back when we were five people). There is an excellent blog post about it here:

http://bret.appspot.com/entry/how-friendfeed-uses-mysql

When Cassandra came along, we had long debates internally about the risk of hitching ourselves to their wagon. Eventually we decided that an open source project with a similar (enough) architecture to what we'd been building ourselves was preferable to maintaining a python library in-house.

So far we have no regrets. We have had no production issues and we have an architecture that is built to handle a large amount of incoming data (every note and highlight you take in an Inkling book is synched to our servers). We had already stripped away most SQL functionality by using FF's architecture, so we didn't lose much. And we gained a lot - the ability to tune the durability of different writes depending on the use case, far better fault tolerance, a conceptually simpler indexing strategy, etc.

It definitely has a learning curve - error and status reporting is not wonderful, you generally need to be over-provisioned as a strategy, etc. But I'm happy we learned those lessons sooner rather than later - I'd hate to scramble to implement Cassandra in a strained production environment (my sympathies to a certain website who suffered through this).

psadauskas · on Dec 6, 2010

We use TokyoTyrant to store lots (TBs) of tiny (<100bytes), constantly updating bits of data. The record overhead on Mysql and Postgresql was huge, and the continual updates were leaving lots of dead tuples that had to be cleaned, and killed performance.

I use MongoDB for greenfield app development and experiments, so I can quickly alter "tables" and "attributes", without having to bother with first thinking about schema, or having to write migrations.

seunosewa · on Dec 6, 2010

What kind of data was that, by the way? (in real-life terms).

rbranson · on Dec 6, 2010

Had the same experience with Tokyo Cabinet/Tyrant on a project. We had hundreds of millions of small records to store, and TCH could pack it into space fall enough to fit into disk cache on an XL EC2 instance.

benologist · on Dec 6, 2010

I use it mostly to store player-created levels and leaderboards which allows developers to efficiently + effortlessly include and filter by custom fields and data.

I also use it for some smaller tasks like storing filenames/hashes for distributing content across my 5 servers (doesn't really need to be mongo for this), and some configuration information for games using me.

I do it via MongoHQ, I did have it running and replicated across 2 of my servers but I'm just not familiar enough to keep it running and that bit me in the ass early on. With MongoHQ all I have to worry about is making sure my indexes are right.

Next up I'll be using it for some non-aggregated analytics stuff, which won't be as big as Facebook's requirements but will still be in the billions-a-month range.

vyrotek · on Dec 6, 2010

Glad to hear that switching to Mongo worked out for you. I remember talking to you a while ago when most stuff was running on SqlServer.

For our project we went the Windows Azure route and store our 'events' in their NoSql Table Storage. I must admit I have been tempted to switch to MongoHQ due to the lack of query support from Table Storage. Funny enough, we're also working on a custom leaderboards feature for our customers and the lack of secondary indexes really makes it difficult.

benologist · on Dec 6, 2010

MongoHQ have a pretty awesome product. The only concern I have with them is I noticed they recently capped databases to 20 gig but that's 3x more than my leaderboards come to at the moment so that's future-me's problem.

The custom data on the leaderboards and levels is just ridiculously easy, it's basically just copying object properties/values on the user end straight in to document properties/values in MongoDB.

benwyrosdick · on Dec 9, 2010

Future-you should fear not. The 20gb limit is a soft limit and we have many who have gone past it. It is really there for two reasons ...

1) To give people some reasonable expectations about how much data they should be putting on the plan given it is a multi-tenant server and they are sharing resources.

2) To make room for an expanded offering, both a larger high-availability shared plan, and a dedicated server plan.

panarky · on Dec 6, 2010

I use key-value stores so I can implement the right data model for the right problem.

The article cites graph databases as a better fit for graph applications than relational SQL databases. I'm using Redis data structures like sets and sorted sets that are 20x to 100x faster than relational databases.

But I still use SQL for most CRUD operations where I care about consistency and volume isn't high enough to require sharding. There's a place for 'NoSQL', but it's not a panacea.

mkramlich · on Dec 7, 2010

I'm using MongoDB as the default data store for a new geo LBS startup. Scalability and performance features were the main attraction, but also I just like the interfaces and ergonomics it presents to me as a programmer. Much prefer representing stored data structures in JSON, doing queries in RESTful HTTP-ish friendly form, and doing processing in Python, JavaScript, etc. Oh and not having to care about data schemas. At least early on in the project lifecycle, when there's no revenue and you may need to frequently evolve and/or significantly pivot the codebase. I love the fact that it tries to keep everything in memory if it can, and that it was designed from the ground up to be distributed.

MongoDB is also rapidly becoming my "goto" data store when making prototypes to explore some new software project space. Because it's so amenable to rapid development and change. It's also replacing memcached in some situations where all I needed was a dumb memory cache. It can act like a dumb memory cache, except it has these extra features waiting in the wings I want, which is a bonus.

I liked Redis in my evaluations and may use it more in the future but it lost out to MongoDB for the LBS project because it didn't fit the requirements as well or get enough little "taste wins" with me.

rb2k_ · on Dec 6, 2010

I use Redis as a Cache and Queueing System at the same time.

Workers connect and get new jobs, they also write the results to redis. Everything touches redis and redis is just so damn fast that I don't have to care about scaling for a while :)

jfb · on Dec 7, 2010

We do this as well. Having sets, sorted sets and hash tables built in makes grody hacks much less common, and Redis is seriously fast. We persist the important info back to a more conventional store (Postgres), but we can keep a large proportion of our current working set in memory, in Redis. That rocks.

dasil003 · on Dec 7, 2010

This is the very thing that attracted us to redis. We haven't come close to outgrowing a single MySQL server yet, but there was a handful of areas where we felt MySQL lacking. So far we have been able to make major performance improvements and decrease our database size dramatically, without worrying about redis persistence at all. That is, we use it purely in a caching capacity where any data it contains can be reconstructed at any time.

The low-hanging fruit have been counter caches (lock contention with mysql counters was a problem wayyy too early), transient data (sessions, ip bans, etc), and picking random items from sets (other RDBMSes may have better ORDER BY random() properties, but MySQL sucks).

In general I feel like 90% of our data is best served by a relational database. It's possible to shoehorn the rest in, but redis primitives allow highly targeted improvements to both performance and elegance.

jfb · on Dec 7, 2010

Yeah, not having to implement (say) priority work queues YET AGAIN in a RDBMS was like checking your regulator at 40m and finding that you have enough air to get back to the surface, after all.

We do a bunch of calculation with cached data in Redis, and it enables the naïve pattern of grabbing chunks of shit from the data store without having to cope with the horrible SQL generated by the ORM.

jasonjei · on Dec 7, 2010

I've started using NoSQL with an ACID-based system by storing the document revisions (e.g, modified invoices, purchase orders) into a NoSQL database with the SQL database holding the master data. We store document revisions so if we were to ever audit modifications of the invoice, we could pull up the very first version to the second, third, nth revision of the document. Meanwhile, we still keep master data on SQL database.

blantonl · on Dec 7, 2010

Sounds like you are doing what we are doing. Testing the waters, evaluating how NoSQL technologies fit into your architecture, and testing your problems against NoSQL solutions.

It is refreshing to see the approaches.

jasonjei · on Dec 7, 2010

Thanks for the compliments!

I'm still a bit wary of using NoSQL, and we store document revisions (e.g, if I modified invoice 1021, the lastest version would be updated in SQL, and the last version would be pushed into NoSQL) into SQL just in case (by using a draft_id column).

We need to see if our software is still deemed auditable by accountants since they have very strict guidelines on how a document trail should be stored.

metamemetics · on Dec 6, 2010

I am investigating replacing MySQL with MongoDB in my model layer for my next prototype.

My mind is still thinking in 3NF though. I understand dernormalization and avoiding joins will be useful from a performance standpoint. However I am unsure when to either go ahead and include a foreign key, retrieve it and perform a second query from the application layer, and when to go ahead and duplicate\embed all the field data. I'm leaning towards just making 2 simple sequential key lookup queries, the 2nd on the retrieved foreign key, rather than duplicating fields everywhere and keeping track of massively cascading changes. Instead of performing 1 MySQL Join. Although I usually think in terms of minimizing roundtrips to the database server.

Wondering if anyone has a heuristic for this or suggested reading?

rbranson · on Dec 7, 2010

Please ask yourself why. In reality, MongoDB has almost all of the same limitations of MySQL and PostgreSQL, but lacks the production proven record. In addition, MongoDB has very weak durability guarantees on a single server, poor performance for data that is not in-memory, and continues some common SQL/ACID scalability pitfalls (use of arbitrary indexes and ad-hoc queries).

Outside of this, you need to switch the question you ask from "what data do I need to capture?" to "what questions do I need to answer?"

metamemetics · on Dec 7, 2010

>Please ask yourself why.

Looking for a way to hands-free scale very cheaply without vendor lockin. Would be nice if I can simply add another machine to the database to the cluster. And not have to generate the ids in the application layer, use hashing algo to select correct machine, and have everything stop working when a single database goes down. Seems like it should be a solved problem by now. Investigating new tech won't set me back much time, and my MySQL queries aren't going to disappear if I don't like it. Furthermore looking at large sites such as Flickr that massively scaled MySQL it seems like they stopped using its relational features anyway.

rbranson · on Dec 7, 2010

It's not as "hands-free" as you'd like to believe. Check out the MongoDB sharding introduction[1]. There are some pretty big caveats. Very few people are using auto-sharding at scale in production (bit.ly and BoxedIce are all I know of).

There are other operational issues with MongoDB. MongoDB can only do a repair if there is twice the available disk space as the database uses, and the server must be effectively brought offline to do this. To reclaim unused disk space, you have to do a, you guessed it, compact/repair. Want to do a backup? The accepted way to do this is to have a dedicated slave that can be write-locked for however long it takes to do your backup. They suggest using LVM snapshots to make this short, but disk performance on volumes with LVM snapshots is terrible.

I would consider using MongoDB for a setup that would either be either non-critical, completely within memory with bounded growth (which itself sort of begs the question...), or involve mostly write-once data, such as feeds, analytics, and comment systems.

[1] http://www.mongodb.org/display/DOCS/Sharding+Introduction

metamemetics · on Dec 7, 2010

Well my platform is n number of $20 linodes to start. I'm clustering the python application across them using uwsgi+nginx (all I have to do is add an IP address in the config to scale), it's going to be a given that I shard the database across them as well. If you feel I should avoid Mongo would you recommend Cassandra instead?

Regardless, I think my initial question regarding when to denormalize data applies to any database including scaled MySQL, but perhaps was a better question for stackoverflow.

rbranson · on Dec 7, 2010

Cassandra has it's own hurdles, but I think if we're talking about getting your mind in the right place, it might be a better answer. Cassandra definitely has a much more mature scalability implementation that isn't caveat-ridden like MongoDB is. It's operating at scale at both Twitter and Facebook.

Cassandra has online compaction, but still requires up to 2x space for compaction. However, Cassandra does not have to do a full scan of the entire database to do compaction, and almost never actually uses the 2x space. It's also much easier to maintain a Cassandra cluster, because each instance shares equal responsibility, and replication topology is handled for you.

Despite what their fans will say, these are both beta-quality products.

minow12 · on Dec 7, 2010

Care to back this up? I think MongoDB is awesome from what I've seen so far and I've heard nothing but good thiings about it even in the performance sector. I'd love to see more info on what you're saying. Thanks.

rbranson · on Dec 7, 2010

MongoDB is designed specifically for multi-server durability. All writes are done in-place. It will support single-server durability in 1.8. Until then, 10gen strongly recommends against using a single-server setup with any data you care about[1].

I don't have any ready figures in front of me for disk performance, but MongoDB uses an in-memory layout and memory-mapped files, which provide a far less than optimum data layout for disk performance. As you might imagine, it works well with SSDs, but the performance is awful on rotational disks. Foursquare's outage was caused in part when their MongoDB instance began to exceed available RAM. My own experiences with larger-than-RAM MongoDB collections mirror this conclusion. Under these circumstances, you're likely to see much worse performance with MongoDB than MySQL or PostgreSQL, especially with concurrent writes[2].

As far as the SQL pitfalls, and database in general, don't think for a second MongoDB has some magic that exempts it from performance problems that RDBMS' have. Start using any of the familiar SQL scaling "no-no" features: multiple indexes, range queries, limits/offsets, etc.... and it's going to start exhibiting performance characteristics like a PostgreSQL or MySQL database would under those circumstances.

The temptation with MongoDB is to be lured into this idea that a brand new, immature database with convenient, SQL-like features can perform significantly better than it's highly-tuned RDBMS brethren. There are some reasons to choose MongoDB over MySQL and PostgreSQL, but performance should not be one of them.

[1] http://www.mongodb.org/display/DOCS/Durability+and+Repair#Du...

[2] http://jira.mongodb.org/browse/SERVER-574

FooBarWidget · on Dec 7, 2010

All the same limitations? Most SQL databases don't support anywhere near the amount of atomic operations that MongoDB provides. Also sharding MongoDB is a lot easier than sharding MySQL and PostgreSQL.

robotadam · on Dec 7, 2010

I'm a little confused by this; how do SQL databases not support more atomic operations that MongoDB, seeing as you have full control of the transactions themselves? Incrementing, adding to a list (generally done with inserting a row in another table), etc., are all standard operations.

FooBarWidget · on Dec 12, 2010

Try upserting (update if exist, insert if not exist) a row, that doesn't work race-condition-free even in SERIALIZABLE isolation.

cullenking · on Dec 7, 2010

We used mongodb for a weekend hack/prototype cycling news site. It was an excuse to play with Mongo, not because we felt we'd need Mongo for this particular site. I am very glad we did use it, because I now understand some of the cool things you can do with Mongo.

Basically, if you are prototyping and have a few extra hours to spend playing, I would say go for it. Can't hurt to understand the tool, so you can pick it if it makes sense for what you need to do down the road.

voxcogitatio · on Dec 6, 2010

The confusion about the use case of NoSQL probably stems from the term "NoSQL" being so vaguely defined. All you can say for sure about one is that it's not relational. But other than that there's not much in common between (for example) key-value stores, graph databases and object databases.

Result: The author has to qualify all statements with "only applies to some NoSQL databases".

jfb · on Dec 7, 2010

I would love a relational database with some less horrible language than SQL to manage it. That'd be NoSQL, no lie, but considerably different than a KV store.

cake · on Dec 6, 2010

I thinks "NoSQL" a bad name too. With NoSQL you would expect the opposite of a db managed by SQL when in fact it's something completetly different.

stavros · on Dec 6, 2010

I used MongoDB for my MSc thesis, and I really enjoyed the silent data corruption feature [1]. That said, I would use MongoDB again if it stops being horribly unstable.

These days, I use redis for caching and as a celery backend, and I love it to bits.

[1] not.

mkramlich · on Dec 7, 2010

please dish. esp what version of MongoDB you used when you had the corruption issue.

stavros · on Dec 7, 2010

Here's part 2, there's a link to part 1:

http://www.korokithakis.net/node/119

Also, because everyone is going to skim the post and say "you were using the 32-bit version":

1. Only for some of the corruptions.

2. IT SHOULDN'T CORRUPT DATA ANYWAY!

fish2000 · on Dec 7, 2010

SQL gives some people the howling fantods. I think a large part of the programming-nerd population looks at it and sees it as a kludgy chimera like JCL, or an inscrutably unfunny INTERCAL variant. Maybe they've been forced to use a shitty ORM leaked abstractions on a project of theirs which they had to clean up; maybe someone slipped them the hot SQL injection, back in the days of CGI; maybe their parents were tragically trampled to death by an elephant while a nearby dolphin laughed. Regardless of motiation, you have to concede to whom it riles that SQL is not the most likeable language/model/framework/paradigm out there.

Personally, I actually love SQL. You asked me, nothing satisfies like nested right inner joins. But that's not who I am. I have needs. Sometimes, what I need is a schemaless eventually-consistent document-oriented persistent data store, because I am aggregating data from multiple web service APIs whose field names and structures change around like they were samples on a Girl Talk record. CouchDB ain't SlouchDB... I can dance to that.

I can tell that some of my buddies are embracing databases from the ranks of the quote-unquote NoSQL movement because these databases' aesthetics are not at all like SQL. That's the catch with NoSQL -- it's a really pointless thing to talk about because it's not a thing; it is an un-thing, a classification of everything in the contemporary database world that is not SQL. It's the kind of classification that makes the most sense in the emotional context of how people feel about SQL.

earlyresort · on Dec 6, 2010

Writing and processing a ton of analytics data - all the raw logs sent into Flurry are stored in HDFS and then processed with map-reduce jobs.

rantfoil · on Dec 7, 2010

At Posterous we use MySQL as a main data store but MongoDB for simple analytics and Redis for set operations around contacts and subscriptions.

Redis is also great as a backing store for the Rails queuing system Resque.

dadro · on Dec 7, 2010

I'm using Mongodb to store Real Estate MLS data. MLS vendors all have their own schema for storing property data that can change on a regular basis. Rather than attempt to map every field we store each property as a document. We index the important fields (price, beds, baths, etc) but all other "metadata" is stored as a Dirty Attribute (in an embedded document).

We are still beta testing and have 100K properties and ~1 million images stored. Query time is faster than our current LAMP site. We will shard based on MLS Vendor as we add more.

mindcrime · on Dec 7, 2010

This blog entry ( http://www.engineyard.com/blog/2009/ldap-directories-the-for... ) is a little old now, but makes an interesting point - that LDAP was the original "NoSQL." If you buy that idea, lots of people have been using NoSQL - and using it for all sorts of things - for quite some time now.

jfb · on Dec 7, 2010

But LDAP is an Abomination unto the eyes of the Lord.

tmcw · on Dec 6, 2010

PakistanSurvey.org is using MongoDB for general data storage of tons of survey responses and aggregation of them.

Similar sites we're working on use CouchDB in the same capacity, because we can push simple analysis back into the database and represent more complex data in a natural form.

smhinsey · on Dec 6, 2010

I use a combination of semi-reliable stores, including a NoSql table storage provider, to implement durable messaging for distributed systems with nodes located in any number of data centers or clouds.

kingnothing · on Dec 7, 2010

Vitrue is currently using an autosharded mongodb cluster in production for writing and reading analytics data for our apps. We're still adding it across the board, but so far its going great.

die_sekte · on Dec 6, 2010

AFAIK Stylous.com uses vertexdb, a graph database.

alecthomas · on Dec 7, 2010

We're using Redis because it's zero-configuration-required. This makes deploying on end-user networks much much easier.

dgregd · on Dec 6, 2010

No ACID db is most useful in free web apps. Where users data is practically worthless. So who cares about occasionally lost data record.

When people are paying for a service then ACID db is a must.

rbranson · on Dec 6, 2010

Electronic banking systems are far from ACID (think ATM operations, check and credit card transfer processes). Stock exchanges aren't ACID. Amazon.com isn't ACID. Logistics systems (FedEx, UPS, etc) aren't ACID. In fact, if you look at the information systems of Fortune 100 corporations, you'll find that almost every single one of them is non-ACID at the core.

jhugg · on Dec 6, 2010

This is not really very accurate.

At their core, each of these systems use ACID databases (or very nearly if you nitpick about isolation levels in Oracle). Between databases and between companies, they've developed "eventually correct" schemes to synchronize information. The lease patterns that many of these systems rely on require atomic operations at their core and offer stronger guarantees than the basic eventually consistent systems popularized by models like Amazon's Dynamo.

It's not just that the account values have to agree between systems at the end of the day; they have to actually be correct at the end of the day.

I know that consistency models vary between NoSQL systems (and even within a single NoSQL system). There's some great technology out there and plenty of problems to solve. There's are certainly plenty of use cases for NoSQL systems within banks and stock exchanges.

But the "banks are eventually consistent" line of reasoning needs to die.

rbranson · on Dec 6, 2010

EDIT: For full disclosure purposes, the parent post is from John Hugg, a software engineer at VoltDB, which is a high-scale data store that competes with many "NoSQL" databases. I am not claiming his point of view is invalid, just that it comes from a certain perspective, and should be viewed in this light.

Amazon's Dynamo itself is built on BerkeleyDB, which is ACID compliant. That doesn't mean Dynamo is an ACID system. You have to view the system as a whole, not just the component parts. The information systems I refer to in large banks, stock exchanges, and logistics are often composed of thousands of instances of ACID-compliant databases, but as a whole operate with eventual consistency guarantees. EC is kind of a misnomer for Dynamo anyways, because it's really TUNABLE consistency. Dynamo can operate in a fully consistent mode, but you're going to sacrifice availability. CAP theorem doesn't care if you're a bank or a stock exchange or you have a trillion dollars. It still applies.

jhugg · on Dec 6, 2010

I was referring to the Dynamo model for EC, not a particular software implementation. That model has next-to-zero traction in systems that handle non-trivial sums of money, and for good reason.

That's not to say it's not great for lots of things. Amazon uses this model for all kinds of stuff, but as soon as you go to checkout your order, you get kicked back into ACID-ville.

rbranson · on Dec 6, 2010

Dynamo's consistency ranges from fully consistent to loosely consistent and is tunable on a per-application and per-call basis. This means a write can be reconciled against 100% of the replicas before it is considered finished. How is this any different than the consistency guarantees an ACID compliant system provides?

You make a claim that Amazon uses ACID semantics for the checkout process. The Dynamo paper claims that in order to meet business goals, the complete Amazon.com order process must be highly-available and partition tolerant. A system that is ACID compliant must sacrifice availability or partition tolerance, but this isn't the case for Amazon's purchase process. Amazon simply strengthens consistency and durability guarantees in the case of checkouts with a quorum write. It would be exceedingly rare for a partition or disaster to knock out communication with more than one datacenter, so this works very well in practice.

jhugg · on Dec 7, 2010

Quorum writes to Dynamo nodes allow you make atomic and durable updates to single keys and the data associated with them. ACID transactions allow you to mutate data associated with multiple keys atomically. This is not the same thing at all. Common operations like debit from one account and credit another or sum a set of values become difficult.

Now atomic updates to single keys can actually be used as the building blocks for transactional functionality (using the lease pattern and/or compensating operations). If you need a transaction here or there, this might be a very workable solution. If you need lots of transactions, then you end up using Dynamo systems in unnatural ways; many of their performance and availability advantages are wasted in this configuration.

So yes, you could build a bank given quorum writes, but the point is that it would probably be a poor engineering choice.

rbranson · on Dec 7, 2010

I wasn't saying banks should use Dynamo, but that Dynamo-style consistency is appropriate for conducting e-commerce transactions, and originally that most scaled industries don't depend on ACID semantics in their core datastores.

Bank accounts are eventually consistent logs, which doesn't look like Dynamo, but also look nothing like an RDBMS table. It's well known and understood that almost all banking systems are based around mainframe-era batch-process systems that are eventually consistent in nature. It's awfully ironic that the "hello world" used to demonstrate transactions in the RDBMS world is a debit and credit of bank accounts, a situation that is vanishingly rare in the real world.

Also, please address my original rebuttal to your claim that Amazon uses an ACID database for their checkout process.

jhugg · on Dec 7, 2010

Again, replying to a deeper comment.

I sure don't disagree with everything you are saying.

As for adding transactions to get a balance, yes, this is how the basics work and it doesn't look like ACID at all. However, what you're describing isn't eventual-consistency (per Dynamo), it's eventual-completeness. You can't lose a log entry. All of these messages are two-phase committed between systems.

So as I said before, you could build this kind of system using quorum writes on a Dynamo system, but it wouldn't be a solid engineering choice.

Also, yes, BerkelyDB has multi-key transactions. Dynamo on top of BDB does not expose this functionality or use it (as described in the paper). Publicly available systems that implement Dynamo like Cassandra, Riak or SimpleDB do not have native transactions either.

Fun chatting. I'm off to bed.

jhugg · on Dec 7, 2010

Replying to the deeper nested comment.

1) I totally concede that you could use Dynamo-style consistency to be the system of record for financial exchanges. I just don't think it's a good idea. I'm not sure anyone building these systems does either.

Though if you're doing e-commerce with a payment processing gateway like PayPal, you might be in a different situation. You're no longer anything like a bank at that point (PayPal is).

2) I don't know what Amazon does now, but last time I talked with them (2009) they were huge users of legacy RDBMSs for actual order processing. I know they're not thrilled with this arrangement, but I doubt they think that they think Dynamo is the answer. Perhaps cores of ACID-ness with Dynamo-esque replication between them...

3) You're right that banks use logs for a tremendous amount of things. And a VISA charge looks nothing like a simple debit and credit transaction. However, if you think that banks don't have debit and credit transactions in their applications, you haven't worked on their applications.

rbranson · on Dec 7, 2010

I am starting to believe that we both essentially believe mostly the same points, but we're using different nomenclature.

You stated previously that most of these real-world, high-scalability, high-availability systems use ACID databases as a backing store, something I echo'd in the context of Dynamo's use of BerkeleyDB (and MySQL/InnoDB) as an underlying data store. I agree with this.

You state that a Dynamo-style key-value store wouldn't be that great for a financial institution's accounting system. I agree with this. Dynamo is terrible for "log" data.

This doesn't change the idea that these system as a whole aren't ACID, and do not require ACID semantics. Financial systems are often based on networks (in the conceptual sense) with varying degrees of trust, and are batch-reconciled logs of transactions. I'm talking about stock and commodity markets, electronic transfer systems, and traditional banking. They may be built from ACID building blocks at the low level, but that does not change the fact that systems that manage my checking account, credit cards, and stock market transactions are not functionally ACID or that they even need to be ACID. Dynamo itself is built on ACID databases, but it is not ACID itself.

The balance of an account can be calculated by adding all credits and debits, which is what makes a financial account radically different than say, a Facebook profile. Each transaction is essentially immutable. Even if the transaction must be reversed, it's reversed as an additional credit (or a unique reversal transaction), not a deletion of the original transaction. While transactions against the account are almost always immediately available, they aren't ALWAYS, and this is alright, because the bank's customers understand this. Banks still reconcile accounts on a nightly basis, using batch processes. This is when accounts are officially settled and consistency is applied, which is why it's eventual. However, financial accounts are somewhat unique because their date-sorted, log-structured nature makes them quite suited to eventual consistency.

jhugg · on Dec 7, 2010

As for the VoltDB comment, how much we compete with NoSQL or whether we are NoSQL is an interesting line of questions. We're trying to bleed customers away from entrenched systems the same as most NoSQL systems. I think there's plenty to go around.

spenrose · on Dec 6, 2010

I take your points. Here's the thing: if ACID is an aspect of developer tools and not a user experiences, and if ACID tools are the default, and if (as anyone who has held a checking account in the U.S. would surely agree) user-experience failures analogous to ACID violations occur regularly, isn't that evidence that we should continue to question what value we actually get from basing our systems on ACID/relational data stores?

jhugg · on Dec 6, 2010

Asking what value a particular app gets from an ACID store vs a non-ACID store as part of a comprehensive analysis of two competing technologies is a GOOD THING.

As for banks, exchanges, credit cards, packages, etc:

When non-trivial sums of money are changing hands, ACID stores offer tremendous benefits as a building block in a larger system.

Yes, things go wrong. Even given ACID building blocks, building huge banking systems is hard. Forcing bank developers to worry about consistency not just between systems, but also within systems isn't going to make that job easier.

rewind · on Dec 6, 2010

But it's not always the data by itself that has value. Sure, it might be useless on its own, but if your app loses it and it causes the experience of using the app to be bad, then it doesn't matter how useless you think the data is; it's going to hurt the user's perception of your product.

Also, "free web apps" doesn't translate to "worthless user data", although that's a different argument altogether.

chrisaycock · on Dec 6, 2010

Twitter is pretty lucky they were able to survive all of the negative press when they kept going down a couple years back.

nickik · on Dec 6, 2010

With alots of the big KeyValue store like cassandra you can make your data very save. You just have to traid of other stuff.