Apple open-sources FoundationDB

wwilson · on April 19, 2018

This is INCREDIBLE news! FoundationDB is the greatest piece of software I’ve ever worked on or used, and an amazing primitive for anybody who’s building distributed systems.

The short version is that FDB is a massively scalable and fast transactional distributed database with some of the best testing and fault-tolerance on earth[1]. It’s in widespread production use at Apple and several other major companies.

But the really interesting part is that it provides an extremely efficient and low-level interface for any other system that needs to scalably store consistent state. At FoundationDB (the company) our initial push was to use this to write multiple different database frontends with different data models and query languages (a SQL database, a document database, etc.) which all stored their data in the same underlying system. A customer could then pick whichever one they wanted, or even pick a bunch of them and only have to worry about operating one distributed stateful thing.

But if anything, that’s too modest a vision! It’s trivial to implement the Zookeeper API on top of FoundationDB, so there’s another thing you don’t have to run. How about metadata storage for a distributed filesystem? Perfect use case. How about distributed task queues? Bring it on. How about replacing your Lucene/ElasticSearch index with something that actually scales and works? Great idea!

And this is why this move is actually genius for Apple too. There are a hundred such layers that could be written, SHOULD be written. But Apple is a focused company, and there’s no reason they should write them all themselves. Each one that the community produces, however, will help Apple to further leverage their investment in FoundationDB. It’s really smart.

I could talk about this system for ages, and am happy to answer questions in this thread. But for now, HUGE congratulations to the FDB team at Apple and HUGE thanks to the executives and other stakeholders who made this happen.

Now I’m going to go think about what layers I want to build…

[1] Yes, yes, we ran Jepsen on it ourselves and found no problems. In fact, our everyday testing was way more brutal than Jepsen, I gave a talk about it here: https://www.youtube.com/watch?v=4fFDFbi3toc

voidmain · on April 19, 2018

Will said what I wanted to say, but: me too. I'm super happy about this and grateful to the team that made it happen!

(I was one of the co-founders of FoundationDB-the-company and was the architect of the product for a long time. Now that it's open source, I can rejoin the community!)

nlavezzo · on April 19, 2018

Another (non-technical) founder here - and I echo everything voidmain just said. We built a product that is unmatched in so many important ways, and it's fantastic that it's available to the world again. Will be exciting to watch a community grow around it - this is a product that can benefit hugely from OS contributions as layers that sit on top of the core KV store.

lmorris84 · on April 20, 2018

I wrote a java port of the python counter many moons ago [1]. Will have to resurrect it!

[1] https://github.com/leemorrisdev/foundationcounter

voidmain · on April 20, 2018

It should probably be pointed out that atomic increment is in most situations a more efficient solution for high contention counters in modern FDB.

lmorris84 · on April 20, 2018

Ah I don’t believe that was available last time I used it - I’ll check it out thanks!

rcgenova · on April 19, 2018

Ditto. So glad to see FoundationDB available again! Both the tech and the company have been missed. (Former FoundationDB Solutions Engineer :) )

nagarc · on April 20, 2018

Nice to see you are joining back!

ashishnm · on April 19, 2018

I echo what wwilson has said. I work at Snowflake Computing, we're a SQL analytics database in the cloud, and we have been using FoundationDB as our metadata store for over 4 years. It is a truly awesome product and has proven to be rock-solid over this time. It is a core piece in our architecture, and is heavily used by all our services. Some of the layers that wwilson is talking about, we've built them. metadata storage , object-mapping layer , lock manager , notifications system . In conjunction with these layers, FoundationDB has allowed us to build features that are unique to Snowflake. Check out our blog titled, "How FoundationDB powers Snowflake Metadata forward" [1]

Kudos to the FoundationDB team and Apple for open sourcing this wonderful product. We're cheering you all along! And we look forward to contributing to the open source and community.

[1] https://www.snowflake.net/how-foundationdb-powers-snowflake-...

polskibus · on April 20, 2018

That's impressive! Are you going to contribute your changes to foundationdb back?

ddorian43 · on April 19, 2018

Where is the object-mapping-layer ? Inside the db in c++ or a service (different process etc) outside of it.

ashishnm · on April 19, 2018

Latter. It's a Java library used by our services.

jamesblonde · on April 19, 2018

I am one of the designers of probably the best known metadata storage engine for a distributed filesystem, hopsfs - www.hops.io. When I looked at FoundationDB before Apple bought you, you supported transactions - great. But we need much more to scale. Can you tell me which of the following you have: row-level locks partition-pruned index scans non-serialized cross-partition transactions (that is, a transaction coordinator per DB node) distribution-aware transactions (hints on which TC to start a transaction on)

The relative performance contribution of most of those features (cross-partition transactions not evaluated -it's a must) can be seen in our paper at Usenix Fast: https://www.usenix.org/system/files/conference/fast17/fast17...

voidmain · on April 19, 2018

It's somewhat hard to answer your questions because the architecture (and hence, terminology) of FoundationDB is a little different than I think you are used to. But I will give it a shot.

FoundationDB uses optimistic concurrency, so "conflict ranges" rather than "locks". Each range is a (lexicographic) interval of one or more keys read or written by a transaction. The minimum granularity is a single key.

FoundationDB doesn't have a feature for indexing per se at all. Instead indexes are represented directly in the key/value store and kept consistent with the data through transactions. The scalability of this approach is great, because index queries never have to be broadcast to all nodes, they just go to where the relevant part of the index is stored.

FoundationDB delivers serializable isolation and external consistency for all transactions. There's nothing particularly special about transactions that are "cross-partition"; because of our approach to indexing and data model design generally we expect the vast majority of transactions to be in that category. So rather than make single-partition transactions fast and everything else very slow, we focused on making the general case as performant as possible.

Transaction coordination is pretty different in FoundationDB than in 2PC-based systems. The job of determining which conflict ranges intersect is done by a set of internal microservices called "resolvers", which partition up the keyspace totally independently of the way it is partitioned for data storage.

Please tell me if that leaves questions unresolved for you!

jamesblonde · on April 19, 2018

Thanks for the detailed answer. Is it actually serializable isolation - does it handle write skew anomalies (https://en.wikipedia.org/wiki/Snapshot_isolation)? Most OCC systems I know have only snapshot isolation.

Systems that sound closest to FoundationDB's transaction model that i can think of are Omid (https://omid.incubator.apache.org/) and Phoenix (https://phoenix.apache.org/transactions.html). They both support MVCC transactions - but I think they have a single coordinator that gives out timestamps for transactions - like your "resolvers". The question is how your "resolvers" reach agreement - are they each responsible for a range (partition)? If transactions cross ranges, how do they reach agreement?

We have talked to many DB designers about including their DBs in HopsFS, but mostly it falls down on something or other. In our case, metadata is stored fully denormalized - all inodes in a FS path are separate rows in a table. In your case, you would fall down on secondary indexes - which are a must. Clustered PK indexes are not enough. For HopsFS/HDFS, there are so many ways in which inodes/blocks/replicas are accessed using different protocols (not just reading/writing files or listing directories, but also listing all blocks for a datanode when handling a block report). Having said that, it's a great DB for other use cases, and it's great that it's open-source.

voidmain · on April 19, 2018

Yes, it's really serializable isolation. The real kind, not the "doesn't exhibit any of the named anomalies in the ANSI spec" kind. We can selectively relax isolation (to snapshot) on a per-read basis (by just not creating a conflict range for that read).

I tried to explain distributed resolution elsewhere in the thread.

I believe our approach to indices pretty much totally dominates per-partition indexing. You can easily maintain the secondary indexes you describe; I don't understand your objection.

jwatte · on April 20, 2018

My guess is the objection lies in "Have to manage the index myself."

Also, the main draw-back of "indices as data" in NoSQL is when you need to add a new index -- suddenly, you have to scrub all your data and add it to the new index, using some manual walk-the-data function, and you have to make sure that all operations that take place while you're doing this are also aware of the new index and its possibly incomplete state.

Certainly not impossible to do, but it sometimes feels a little bit like "I wanted a house, but I got a pile of drywall and 2x4 framing studs."

voidmain · on April 20, 2018

"I wanted a house, but I got a pile of drywall and 2x4 framing studs."

This is a totally legitimate complaint about FoundationDB, which is designed specifically to be, well, a foundation rather than a house. If you try to live in just a foundation you are going to find it modestly inconvenient. (But try building a house on top of another house and you will really regret it!)

The solution is of course to use a higher level database or databases suitable to your needs which are built on FDB, and drop down to the key value level only for your stickiest problems.

Unfortunately Apple didn't release any such to the public so far. So I hope the community is up to the job of building a few :-)

elcritch · on April 20, 2018

I agree with voidmain’s comment as secondary indexes shouldn’t be any different than the primary KV in your case. Almost seems that you’re focusing on a SQL/Relational database architecture but storing your data demoralized anyways. Odd combination of thoughts.

spullara · on April 20, 2018

I love the idea of demoralized data :)

kovacs · on April 20, 2018

Is that where the data sits around waiting, eager, to be queried into action while watching data around them being used over and over again.... But that time never comes, thus leaving them to question their very worth?

evanweaver · on April 19, 2018

> Transaction coordination is pretty different in FoundationDB than in 2PC-based systems. The job of determining which conflict ranges intersect is done by a set of internal microservices called "resolvers", which partition up the keyspace totally independently of the way it is partitioned for data storage.

Ok, per my other question that makes sense. Similar to FaunaDB except the "resolvers" (transaction processors) are themselves partitioned within a "keyspace" (logical database) in FaunaDB for high availability and throughput. But FaunaDB transactions are also single-phase and we get huge performance benefits from it.

notacoward · on April 20, 2018

> best known metadata storage engine for a distributed filesystem, hopsfs

Not even close. I don't even see anything I'd call a filesystem mentioned on your web page. I missed FAST this year, and apparently you had a paper about using Hops as a building block for a non-POSIX filesystem - i.e. not a filesystem in my and many others' opinion - but it's not clear whether it has ever even been used in production anywhere let alone become "best known" in that or any other domain. I know you're proud, perhaps you have reason to be, but please.

jamesblonde · on April 20, 2018

Yes, it's used in production. There's a company commercializing it: www.logicalclocks.com

agibsonccc · on April 21, 2018

I'm not convinced a paper with a website and some proof of concepts would be considered the "best". You're throwing around a bunch of components in to a distro calling yourselves everything from "deep learning" to a file system. It's not clear what you guys are even trying to do here.

spullara · on April 19, 2018

You don't need to worry about shards / partitions with FDB. Their transactions can involve any number of rows across any number of shards. It is by far the best foundation for your bespoke database.

bsaul · on April 19, 2018

Your talk was one of the best talk i've seen , and i keep mentionning it to people whenever they ask me about distributed systems, database and testing.

i'm incredibly impatient to have a look at what the community is going to build on top of that very promising technology.

srigi · on April 21, 2018

Can you share that talk here too?

nunb · on April 23, 2018

link to the [talk] from the parent comment [by wwilson]

[talk] https://www.youtube.com/watch?v=4fFDFbi3toc

[by wwilson] https://news.ycombinator.com/item?id=16877401

thefounder · on April 19, 2018

I'm not familiar with FDB but what you say sounds almost too good to be true. Can I use it to implement the Google Datastore api? I'm trying for years to find a suitable backend so that I can leave the Google land. Everything I tried either required a schema or lacked transactions or key namespaces.

wwilson · on April 19, 2018

As an existence proof: before the acquisition we built an ANSI SQL database and a wire-compatible clone of the MongoDB API.

I see no reason you wouldn't be able to implement Datastore. In fact here's a public source claiming that Firestore (which I believe is its successor) is implemented on top of Spanner: https://www.theregister.co.uk/2017/10/04/google_backs_up_fir...

rthomas900 · on April 19, 2018

If I recall correctly, the "SQL layer" you had in FDB before the Apple acquisition was a nice proof of concept, but lacked support for many features (joins in SELECT, for example). Is the SQL layer code from that time available anywhere to the public? (I'm not seeing it in the repo announced by OP.)

tjb1982 · on April 19, 2018

I used to work there. The SQL layer was capable of actually the majority of SQL features including joins, etc. We had an internal rails app that we used to dog-food for performance monitoring, etc. I used to work on the document layer, and was sad to see it wasn't included here.

rthomas900 · on April 19, 2018

Perhaps it's this?

https://github.com/jaytaylor/sql-layer

zwily · on April 20, 2018

That has FoundationDB dependencies that don’t exist in the apple repo (SQL Parser, JDBC Driver, etc)

edwinyzh · on April 21, 2018

I hope this question doesn't feel too dump, but is it possible to implement a SQL Layer using SQLite's Virtual Table mechanism and leverage all foundationdb's features?

X-Istence · on April 19, 2018

Is the wire-compatible clone of the MongoDB API available?

wwilson · on April 19, 2018

It doesn't look like Apple open-sourced the Document Layer, which is a slight bummer. But I echo what Dave said below: what we got is incredible, let's not get greedy!

Also TBH now that I don't have commercial reasons to push interop, if I write another document database on top of FDB, I doubt I'd make it Mongo compatible. That API is gnarly.

_ugfj · on April 19, 2018

> That API is gnarly.

Out of the many MongoDB criticisms, this one is valid. This https://www.linkedin.com/pulse/mongodb-frankenstein-monster-... article is quite right about it (note: endorsing the article does not mean I endorse its author by far).

Other than that, they totally did a fake it until you make it with MongoDB 3.4 passing Jepsen a year ago and MongoDB BI 2.0 containing their own SQL engine instead of wrapping PostgreSQL.

aeorgnoieang · on April 20, 2018

What specifically are you trying to avoid endorsing about the author of the LinkedIn post to which you linked? I couldn't find anything from a cursory web search.

mst · on April 21, 2018

He runs lambdaconf, and refused to disinvite a speaker who many people felt shouldn't be permitted to speak because of his historical non-technical writings.

(I've tried to keep the above as dry as possible to avoid dragging the arguments around this situation into this thread - and I suspect the phrasing of the previous comment was also intended to try and avoid that, so let's see if we can keep it that way, please)

_ugfj · on April 22, 2018

You could try adding controversy to the author name and searching then. As mst correctly notes, I am trying to avoid reigniting said controversy while indicating my distaste.

itp · on April 19, 2018

> That API is gnarly.

The Will Wilson I remember was not prone to such understatement. That API (and the corresponding semantics) was a nightmare.

tjb1982 · on April 19, 2018

I was pretty bummed not see that, myself

wsh91 · on April 19, 2018

Hey there--I'm an engineer on the Cloud Datastore team. I'd love to know more about what your needs are if you're willing to share.

jlgaddis · on April 20, 2018

He "needs" to get rid of you (Google). :-)

thefounder · on April 23, 2018

I've forked the official SDK so that I can get extra functionality but it's quite hard to keep it updated when internal stuff changes. There is no way I can contribute. I can't use it everywhere I want...shortly said it's not open source and this sucks.

julien_c · on April 19, 2018

Honest question, does MongoDB not work for this?

PeCaN · on April 19, 2018

Honest question, does MongoDB work?

btilly · on April 19, 2018

Until recently, no. But apparently they got serious about making it work and now it actually does. See https://jepsen.io/analyses/mongodb-3-4-0-rc3 for verification.

I realized this 2 months ago at https://news.ycombinator.com/item?id=16386129

KMag · on April 20, 2018

That's Nostradamus-level crazytown. What next, PHP strongly enforcing a sound static type system with immutable defaults, GADTs and dependent types?

Congrats to the MongoDB team!

smadge · on April 21, 2018

> What next, PHP strongly enforcing a sound static type system with immutable defaults, GADTs and dependent types?

Too funny!

boubiyeah · on April 19, 2018

MongoDB is pretty average at absolutely everything.

It's fast to install on a dev machine though, lol.

dangets · on April 20, 2018

For better or worse, many times UX is more important than functionality.

skissane · on April 20, 2018

MongoDB is AGPL or proprietary. Many companies have a policy against using AGPL licensed code. So, if you work at one of those companies, then open source MongoDB is not an option (at least for work projects), and proprietary may not be either (depending on budget etc).

FoundationDB is now licensed under Apache 2, which is a much more permissive license, so most companies' open source policies allow it.

diaz · on April 20, 2018

Unless people want to change the mongodb code that they would be using, using the agpl software should be a non issue and there are not problems with it. People should start understanding the available licenses instead of spreading fear.

skissane · on April 20, 2018

It isn't "spreading fear", it is just reality. Google bans AGPL-licensed software internally: https://opensource.google.com/docs/using/agpl-policy/

I know that multiple other companies have a similar policy (either a complete ban on using AGPL-licensed software, or special approval required to use it), although unlike Google, they don't post their internal policy publicly.

If someone works at one of these companies, what do you want to do – spend your day trying to argue to get the AGPL ban changed, or a special exception for your project; or do you just go with the non-AGPL alternative and get on with coding?

jkaplowitz · on April 24, 2018

The main reason it's a problem at many of the companies which ban it is they have a lot of engineers who readily patch and combine code from disparate sources and might not always apply the right discipline to keep AGPL things appropriately separate. Bright-line rules can be quite useful.

It is true that MongoDB's AGPL contagion and compliance burden, if you don't modify it, is less than many fear. It is also true that those corporate concerns are valid. MongoDB does sell commercial licenses so that such companies can handle their unavoidable MongoDB needs, but they would tend to minimize that usage.

ddorian43 · on April 19, 2018

Because MongoDB doesn't (didn't at least) anything at all good except for little complex document modifiers.

It was a very bad rdbms (since they marketed as a replacement) and a very bad sharded db (marketed too).

lars_francke · on April 20, 2018

> How about replacing your Lucene/ElasticSearch index with something that actually scales and works?

Do you have something to back that up? This to me reads like you imply that Elasticsearch does not work and scale.

It's definitely interesting but I'm cautious. The track record for FoundationDB and Apple has not been great here. IIRC they acquired the company and took the software offline leaving people in the rain?

Could this be like it happened with Cassandra at Facebook where they dropped the code and then more or less stopped contributing?

Also I haven't seen many contributions from Apple to open-source projects like Hadoop etc. in the past few years. Looking for "@apple.com" email addresses in the mailing lists doesn't yield a lot of results. I understand that this is a different team and that people might use different E-Mail addresses and so on.

In general I'm also always cautious (but open-minded) when there's lots of enthusiasm and there seems to be no downside. I'm sure FoundationDB has its dirty little secrets and it would be great to know what those are.

jacobparker · on April 20, 2018

> Do you have something to back that up? This to me reads like you imply that Elasticsearch does not work and scale.

https://www.elastic.co/guide/en/elasticsearch/resiliency/cur...

https://aphyr.com/posts/317-call-me-maybe-elasticsearch

https://aphyr.com/posts/323-call-me-maybe-elasticsearch-1-5-...

The statement was blunt but the reputation is not exactly unearned. These problems become worse proportional to scale.

jasode · on April 19, 2018

>It’s in widespread production use at Apple

Anybody know if Apple migrated any projects from Cassandra to FoundationDB?

Or was every project using FoundationDB a greenfield project?

shawn-butler · on April 19, 2018

There were rumors iMessage moved there but ¯\_(ツ)_/¯

Would be great to see from Apple an engineering blog "strengths and challenges at scale" post now that it has been opened again.

itp · on April 19, 2018

I have nothing specific to add to this either other than I also spent years working there, and I'm unbelievably happy about this.

foobaw · on April 19, 2018

It is a genius move from Apple. I just wish they'd apply this logic to a lot of their other stuff.

woolvalley · on April 19, 2018

They are, slowly. Swift is open source, clang is open source. They are moving parts of the xcode IDE into open source, like with sourcekitd and now recently clangd.

I don't think they will ever move 'secret sauce' into open source, but infrastructural things like DBs and dev tooling seems to be going in that direction.

mercutio2 · on April 20, 2018

Well, clangd is a Google project, which Apple has decided to start contributing to, so probably doesn’t belong on your list.

Apple, like everyone else, wants to commoditize their complements.

mattkrea · on April 20, 2018

Clang was an Apple project from the start.. I'm not sure what is telling you it is a Google project

mercutio2 · on April 20, 2018

Clangd != clang.

Woolvalley said “and now recently clangd“, which is a source code completion server which began its life at Google.

Clang did start its life with Apple.

gnat · on April 20, 2018

In case a citation is needed: https://en.wikipedia.org/wiki/Clang

> Apple chose to develop a new compiler front end from scratch, supporting C, Objective-C and C++. This "clang" project was open-sourced in July 2007.

kalleboo · on April 20, 2018

I believe the "d" in the GP comment was not a typo https://news.ycombinator.com/item?id=16874734

pritambaral · on April 19, 2018

~~How is it different from when Apple acquired the then-open-source FoundationDB (and shut down public access)? They could have just kept it open source back then.~~

EDIT: My bad, looks like FoundationDB wasn't fully open-source back then.

happyopossum · on April 19, 2018

From what I recall (and based on some quick retro-googling) I don't believe Foudnation was open-source. One of the complaints about it on HN back in the day was that it was closed...

piva00 · on April 19, 2018

You are correct, FoundationDB was free but never OSS.

You can even check the comments on the HN thread [1] when they removed the downloads.

[1] https://news.ycombinator.com/item?id=9259986

catwell · on April 20, 2018

> In fact, our everyday testing was way more brutal than Jepsen, I gave a talk about it here: https://www.youtube.com/watch?v=4fFDFbi3toc

Unrelated to the original topic, but I had never come across that talk and it is great. I use the same basic approach to testing distributed systems (simulating all non-deterministic I/O operations) and that talk is a very good introduction to the principle.

wenc · on April 20, 2018

In reinforcement learning, the same approach (simulation) to addressing a data deficit is used, although the goal there is learning and not testing.

When certain events happen rarely, you have fake them deterministically so that you have reproducible coverage.

KyleBrandt · on April 19, 2018

Know of any ideas around using this for long term Time Series data? Wonder if maybe something like OpenTSDB but with this as backend instead of hbase (which can be a sort of operational hell)

mzeier · on April 20, 2018

Yes :) it's the core of how Wavefront stores telemetry.

infinite8s · on April 21, 2018

https://www.wavefront.com/wavefront-foundationdb-open-source...

wenc · on April 20, 2018

Not quite the same but TimescaleDB is Postgres-based and is showing a lot of promise for time-series data.

bryanrasmussen · on April 19, 2018

How would you replace a Lucene/Elasticsearch index with foundationDb?

voidmain · on April 19, 2018

It's more like you would build a better Elasticsearch using Lucene to do the indexing and FoundationDB to do the storage. FoundationDB will make it fault tolerant and scalable; the other pieces will be stateless.

fizx · on April 19, 2018

It'd take a low number of hours to wire up FoundationDB as a Lucene filesystem (Directory) implementation. Shared filesystem with a local RAM cache has been practical for a while in Lucene, and was briefly supported then deprecated in Elasticsearch. I've used Lucene on top of HDFS and S3 quite nicely.

If you have a reason to use FoundationDB over HDFS, NFS, S3, etc, then this will work well.

Doing a Lucene+DB implementation where each entry posting lists are stored natively in the key-value system was explored for Lucene+Cassandra as (https://github.com/tjake/Solandra). It was horrifically slow, not because Cassandra was slow, but because posting lists are optimized and putting them in a generalized b-tree or LSM-tree variant will remove some locality and many of the possible optimizations.

I'm still holding out some hope for a hybrid implementation where posting list ranges are stored in a kv store.

voidmain · on April 19, 2018

I think you are on the right track. Storing every individual (term, document, ...) in the key value store will not be efficient, but you should be able to take Lucene's nice fast immutable data structure and stuff blocks of it (at the term level or below) into FDB values very efficiently. And of course you can do caching (and represent invalidation data structures in FDB), and...

FDB leaves room for a lot of creativity in optimizing higher layers. Transactions mean that you can use data structures with global invariants.

fizx · on April 20, 2018

So from the Lucene perspective, the idea of a filesystem is pretty baked into the format. However, there's also the idea of a Codec which takes the logical data structures and translates to/from the filesystem. If you made a Codec that ignored the filesystem and just interacted with FDB, then that could work.

You can already tune segment sizes (a segment is a self-contained index over a subset of documents). I'd assume that the right thing to do for a first attempt is to use a Codec to write each term's entire posting list for that one segment to a single FDB key (doing similar things for the many auxiliary data structures). If it gets too big, then you should have tuned max segment size to be smaller. Do some sort of caching on the hot spots.

If anyone has any serious interest in trying this, my email is in my profile to discuss further.

burntsushi · on April 19, 2018

Hmmm. I'm skeptical. A Lucene term lookup is stupidly fast. It traverses an FST, which is small and probably in memory. Traversing the postings lists itself also needs to be smart by following a skip table, which is critical for performance.

derefr · on April 19, 2018

> you should be able to take Lucene's nice fast immutable data structure and stuff blocks of it (at the term level or below) into FDB values very efficiently.

That sounds a lot like Datomic's "Storage Resource" approach, too! Would Datomic-on-FDB make sense, or is there a duplication of effort there?

wll · on April 20, 2018

It most definitely would.

Datomic’s single-writer system requires conditional put (CAS) for index and (transaction) log (trees) roots pointers (mutable writes), and eventual consistency for all other writes (immutable writes) [0].

I would go as far as saying a FoundationDB-specific Datomic may be able to drop its single-writer system due to FoundationDB’s external consistency and causality guarantees [1], drop its 64bit integer-based keys to take advantage of FoundationDB range reads [2], drop its memcached layer due to FoundationDB’s distributed caching [3], use FoundationsDB watches for transactor messaging and tx-report-queue function [4], use FoundationDB snapshot reads [5] for its immutable indexes trees nodes, and maybe more?

Datomic is a FoundationDB layer. It just doesn’t know yet.

[0] https://docs.datomic.com/on-prem/acid.html#how-it-works

[1] https://apple.github.io/foundationdb/developer-guide.html?hi...

[2] https://apple.github.io/foundationdb/developer-guide.html?hi...

[3] https://apple.github.io/foundationdb/features.html#distribut...

[4] https://docs.datomic.com/on-prem/clojure/index.html#datomic....

[5] https://apple.github.io/foundationdb/developer-guide.html?hi...

mvc · on April 25, 2018

Can't see datomic itself ever doing that though because they'd have to support those features in all the backends.

nl · on April 20, 2018

I wrote the original version of Solandra (which is/was Solr on Cassandra) on top of Jake's Lucene on Cassandra[1].

I can confirm it wasn't fast!

(And to be fair that wasn't the point - back then there were no distributed versions of Solr available so the idea of this was to solve the reliability/failover issue).

I wouldn't use it on a production system now days.

[1] http://nicklothian.com/blog/2009/10/27/solr-cassandra-soland...

burntsushi · on April 20, 2018

> I've used Lucene on top of HDFS and S3 quite nicely.

Out of curiosity, what led you to do this? And what does it do better/worse/differently than out-of-the-box things like Elasticsearch or SOLR?

gazarsgo · on April 21, 2018

HDFS is supported by Solr+Lucene, but rather than my own poor paraphrasing, see what you think of this writeup: https://engineering.linkedin.com/search/did-you-mean-galene

burntsushi · on April 21, 2018

Ah, excellent! Thanks. That answers my question. I also found the idea of early termination via static rank very intriguing.

ddorian43 · on April 19, 2018

See elassandra for a better solandra, keeping lucene indexes and sstables separate. It should be better than keeping posting list in kv-store

bryanrasmussen · on April 20, 2018

ok thanks, it was sort of confusing me

bryanrasmussen · on April 20, 2018

Actually, since my current side project needs good graph support I went looking for foundationDb graph and found this https://github.com/rayleyva/blueprints-foundationdb-graph may have to check it out.

jtchang · on April 19, 2018

I just watched the demo of 5 machines and 2 getting unplugged. The remaining 3 can form a quorum. What happens if it was 3 and 3? Do they both form quorums?

wwilson · on April 19, 2018

A subset of the processes in a FoundationDB cluster have the job of maintaining coordination state (via disk Paxos).

In any partition situation, if one of the partitions contains a majority of the coordinators then it will stay live, while minority partitions become unavailable.

voidmain · on April 19, 2018

Nitpick: To be fully live, a partition needs a majority of the coordinators and at least one replica of each piece of data (if you don't have any replicas of something unimportant, you might be able to get some work done, but if you have coordinators and nothing else in a partition you aren't going to be making progress)

jamesblonde · on April 19, 2018

The arbitrator pattern is a great lightweight solution to the split-brain problem. Here's a great blog about it: http://messagepassing.blogspot.se/2012/03/cap-theorem-and-my...

jtchang · on April 19, 2018

What if the number of boxes are even?

Or do you get around this by not deploying an uneven number of boxes?

maccam94 · on April 19, 2018

The majority is always N/2 + 1, where N is the number of members. A 6 member is less fault-tolerant than a 5 member cluster (quorum is 4 nodes instead of 3, and it still only allows for 2 nodes to fail).

voidmain · on April 19, 2018

The number of coordinators is separate from the number of boxes. You don't have to have a coordinator on every box.

I think you can set the number of coordinators to be even, but you never should - the fault tolerance will be strictly better if you decrease it by one.

gregmac · on April 19, 2018

This makes me wonder also what happens with multiple partitions, for example:

5 into 2, 2, and 1

7 into 3, 2, and 2

traviscj · on April 20, 2018

No majorities, I guess?

ece · on April 19, 2018

Apple also uses HBase for Siri I believe, what are some of the cluster sizes that FoundationDB scales to? Could it be used to replace HBase or Hadoop?

I was in attendance at your talk, and thought it was one of the best at the conference. Apple I think broke some hearts completely going closed-source for a while, but glad to see them open sourcing a promising technology.

mzeier · on April 19, 2018

If scale is a function of read/writes, very large. In fact with relatively minimal (virtual) hardware it's not insane to see a cluster doing around 1M writes/second.

ece · on April 20, 2018

I was talking more about large file storage like HDFS, and the MapReduce model of bringing computation to data. HBase does the latter, and it's strongly consistent like FoundationDB, though FoundationDB provides better guarantees. As a K/V I understand what you and OP say.

killertypo · on April 19, 2018

https://news.ycombinator.com/item?id=16879392 massive

sydcanem · on April 19, 2018

How does this compare with CockroachDB? I'm planning to use CockroachDB for a project but would love to get an idea if I can get better results with FoundationDB.

capkutay · on April 19, 2018

well FoundationDB for one doesn't resort to clickbaity pseudo-benchmark marketing tactics

https://www.cockroachlabs.com/blog/performance-part-two/

zzzcpan · on April 20, 2018

They might be targeting the wrong market, hence the desperate marketing. For people who use MySQL/PostgreSQL a compatible, slower, but distributed database probably just doesn't solve any problem. Those people need a managed solution, not a distributed one.

avip · on April 19, 2018

"we also simulate dumb sysadmins". That was a really inspiring talk, thanks!

StreamBright · on April 19, 2018

Fantastic talk! Very engaging and insightful.

jahhein · on April 21, 2018

That presentation was really good! Well explained on the simulations. If one wanted to get into this exciting event and create something with FoundationDB but no database experience (I do know many programming languages) where would I start? If anyone could point me in the direction, I'd greatly appreciate it.

edwinyzh · on April 23, 2018

How scalable and feasible to implement a SQL Layer on top of SQLite's Virtual Table Mechanism (https://www.sqlite.org/vtab.html) which redirects the read/write of the record data from/to foundationdb?

voidmain · on April 23, 2018

Long before we acquired Akiban, I prototyped a sql layer using (now defunct) sqlite4, which used a K/V store abstraction as its storage layer. I would guess that a virtual table implementation would be similar: easy to get working, and it would work, but the performance is never going to be amazing.

To get great performance for SQL on FoundationDB you really want an asynchronous execution engine that can take full advantage of the ability to hide latency by submitting multiple queries to the K/V store in parallel. For example if you are doing a nested loop join of tables A and B you will be reading rows of B randomly based on the foreign keys in A, but you want to be requesting hundreds or thousands of them simultaneously, not one by one.

Even our SQL Layer (derived from Akiban) only got this mostly right - its execution engine was not originally designed to be asynchronous and we modified it to do pipelining which can get a good amount of parallelism but still leaves something on the table especially in small fast queries.

edwinyzh · on April 24, 2018

@voidmain, Thank you, it's very insightful and clear! I mean, I can see the disadvantage if such SQL layer is implemented directly through SQLite's virtual tables.

tinco · on April 19, 2018

Would it be possible to build a tree DB on top of it like MonetDB/Xquery? I always wondered why XML databases never took off, I've never seen anything else quit as powerful. Document databases if du jour seem comparatively lame.

CMCDragonkai · on April 20, 2018

Yes you can. You need a tree index basically. Any kv store can serve as the backing data structure. I've been writing one for config file bidirectional transformation.

spullara · on April 19, 2018

They had a document layer that wasn't released that was MongoDB compatible — so you can do this.

tinco · on April 19, 2018

MongoDB is just a database with less features than a SQL database. An XML/XQuery database is fundamentally different, so I figured if FoundationDB layers are really so powerful, they might be able to model a tree DB as well.

spullara · on April 20, 2018

Ah misread. Maybe? Storing hierarchical data would be pretty natural.

tjb1982 · on April 19, 2018

That is so exciting! Can't wait to poke around. I wonder if there's anything in there I might recognize

eklavya · on April 19, 2018

Sorry if I missed it but can you or someone else please link where the client wire protocol is documented?

voidmain · on April 19, 2018

The client is complex and needs very sophisticated testing, so there is only one implementation. All the language bindings use the C library.

ddorian43 · on April 19, 2018

Curious why the client is complicated compared to other dbs in same space ?

voidmain · on April 19, 2018

In some distributed databases the client just connects to some machine in the cluster and tells it what it wants to do. You pay the extra latency as it redirects these requests where they should go.

In FDB's envisioned architecture, the "client" is usually a (stateless, higher layer) database node itself! So the client encompasses the first layer of distributed database technology, connects directly to services throughout the cluster, does reads directly (1xRTT happy path) from storage replicas, etc. It simulates read-your-writes ordering within a transaction, using a pretty complex data structure. It shares a lot of code with the rest of the database.

If you wanted, you could write a "FDB API service" over the client and connect to it with a thin client, reproducing the more conventional design (but you had better have a good async RPC system!)

zbentley · on April 20, 2018

> but you had better have a good async RPC system!

The microservices crew with their "our database is behind a REST/Thrift/gRPC/FizzBuzzWhatnot microservice" pattern is still catching up to the significance of this statement.

foodbaby · on April 21, 2018

This might be a dumb question (from someone used to using blocking JBDC) but why is async RPC important in this case? Just trying to understand. And can gRPC not provide good async RPC?

zbentley · on April 24, 2018

I was referring to the trend of splitting up applications into highly distributed collections of services without addressing the fact that every point where they communicate over the network is a potential point of pathological failure (from blocking to duplicate-delivery etc). This tendency replaces highly reliable network protocols (i.e. the one you use to talk to your RDBMS) with ad hoc and frequently technically shoddy communication patterns, with minimal consideration for how it might fail in complex, distributed ways. While not always done wrong, a lot of microservice-ification efforts are quite hubristic in this area, and suffer for it over the long term.

ddorian43 · on April 23, 2018

Imagine single core single threaded design. You send 2 requests for 1 row each.

First request the row needs to be read from disk HDD. It takes 2ms.

Second request, the row is already in ram, it takes microseconds but still has to wait for the first request to finish.

Threads have overhead when having a lot of concurrency (thousands/millions requests/second).

For extreme async, see seastar-framework and scylladb design.

TLDR: high concurrency, low overhead etc.

ddorian43 · on April 19, 2018

Wouldn't layers be hard to be built on the server (since you have to also change the client) and slow to be built as a layer (since it will be another separate service) ?

voidmain · on April 20, 2018

I'm not sure what you are asking, but depending on their individual performance and security needs layers are usually either (a) libraries embedded into their clients, (b) services colocated with their clients, (c) services running in a separate tier, or (d) services co-located with fdbservers. In any of these cases they use the FoundationDB client to communicate with FoundationDB.

mogui · on April 23, 2018

In case (c) or (d) how can a layer leverage the distributed facilities that FDB gives? I mean if I have clients that connect to a "layer service" that is the one who talks to FDB, I have to manage "layer service" scalabily, fault tolerance etc... by myself.

voidmain · on April 24, 2018

Yes, and that's the main advantage of choosing (a) or (b). But it's not quite as hard as it sounds; since all your state is safely in fdb you "just" have to worry about load balancing a stateless service.

mogui · on April 26, 2018

got it, what will you suggest to do something like that? a simple RPC with a good async framework I've read, like what? an RPC service on top of Twisted for python, similar things in other languages?

thanks :)

qaq · on April 20, 2018

I am guessign you'd pretty much embed the client into your higher layer

eklavya · on April 19, 2018

Ok, thanks :)

on April 19, 2018

[dead]

ianamartin · on April 19, 2018

Postgres operates great as a document store, btw. You don’t really need mongo at all. And if you need to distribute because you’ve outgrown what you can do one a single postgres node, you don’t want to use mongo anyway.

If you’ve read any of the comments or have been following the project, it should be pretty obvious that this is far from rookie.

This is a game changer, not a hobby project. This is the first distributed data store that offers enough safety and a good enough understand of CAP theorem trade offs that it can be safely used as a primary data store.

15155 · on April 19, 2018

> I respect hobby projects though

https://techcrunch.com/2015/03/24/apple-acquires-durable-dat...

FDB raised >$22m and was acquired by Apple.

If you consider that a "hobby project," please, teach me how to hobby.

foobarbazetc · on April 19, 2018

Friends don’t let friends use MongoDB.

qaq · on April 19, 2018

https://www.youtube.com/watch?v=4fFDFbi3toc I am big fan of PostgreSQL but it would prob be a good idea to look into a thing you are commenting about.

segmondy · on April 19, 2018

Or perhaps it's not so incredible? Maybe it wasn't such a huge hit for Apple and didn't leave up to expectation so they figure they can give it away and earn some community goodwill.

banana-guy · on April 22, 2018

there were rumors that fdb had never been adopted at Apple because of some limitations and they still ran an in-house version of cassandra

SubuSS · on April 19, 2018

This is great news, when I was with dynamo, FoundationDB was the other green shore for me :). They did so many things so well.

A tiny bit of caution for folks trying to run systems like this though: It is frigging hard at any reasonable scale. The whole thing might be documented / OSS and what not, but very soon you are going to run into deep enough problems that's going to require very core knowledge to debug, energy to deep dive. Both of which you probably don't want to invest your time into. Do evaluate the cloud offerings / supported offerings before spinning these up. Else ensure you have hired experts who can keep this going. They are great as a learning tool, pretty hard as an enterprise solution. I have seen the same issue a ton of times with a bunch of software (redis/kafka/cassandra/mongo...) by now. IMO In the stateful world, operating/running the damn thing is 85% of the work, 15% being active dev. (Stateless world is a little better, but still painful).

rdsubhas · on April 19, 2018

> IMO In the stateful world, operating/running the damn thing is 85% of the work, 15% being active dev. (Stateless world is a little better, but still painful).

The number of newbie engineers who see docker/kubernetes, and think "let me docker run or helm install" a stateful service in a couple of minutes - is mind boggling.

I'll remember this quote when trying to talk sense to them.

shawn-butler · on April 19, 2018

This sounds like a business opportunity to me rather than a cautionary note.

I remember all the people who bashed Apple when they acquired FoundationDB. I hope they are appropriately ashamed now.

15155 · on April 19, 2018

> I remember all the people who bashed Apple when they acquired FoundationDB.

I'm not ashamed about deriding apple for Apple taking a really, really great product and hiding it from the world for years to come.

This is definitely some atonement, but does not totally absolve Apple from the many times they've taken tech private.

djtriptych · on April 19, 2018

I went to the same high school as the founders[1]. They were about the 2 best software engineers in a school with a LOT of very smart software engineers. Another pair founded Yext, which went public last year. I still consider that school the group with the highest concentration of raw brain power I've ever been a part of.

I'm probably a 1% engineer, been hired by M$, FB, and Google. These guys were light years ahead of me. I'm not sure I'm as good now as they were at like 17 years old. In fact I'm probably only a decent engineer from having observed the stuff they were doing back then and finding inspiration.

1: https://en.wikipedia.org/wiki/Thomas_Jefferson_High_School_f...

eitally · on April 19, 2018

I went to UVA and I think about half of the engineering school was from TJ (including 3 of my roommates). :) I can't think of any superior public high school (and I went to a different Governor's School in Virginia myself) anywhere, due to the amazingly large and deep talent pool TJ pulls from. Nothing like it exists in the Bay Area, that's for sure!

saagarjha · on April 20, 2018

The top high schools in the Bay Area often give it a run for its money…

mixmastamyk · on April 19, 2018

I did chuckle at the part above about the "best software engineers in high school," but can't argue with the results.

qaq · on April 19, 2018

That's a very special high school

Glide · on April 19, 2018

It definitely was when I went there.

dhimes · on April 19, 2018

That's a great school. I hired some students there when I ran a tech camp for middle-schoolers and they were...beyond.

verulito · on April 19, 2018

Just checked the rankings. My first high school made #172 on the list ("International School of the Americas").

My second high school is unranked ("Texas Academy of Math and Science"). I don't think it qualifies as a high school. Seems we haven't done a great job of identifying the accomplishments of our alumni, based on the Wikipedia page. No doubt it would rank near the top though. My year alone Caltech accepted about 30 of us, more than any other high school in the country. Makes me wonder what my peers have been up to.

Anyway, I'd agree that these tech high schools have some amazingly smart people attending them.

2: https://en.wikipedia.org/wiki/Texas_Academy_of_Mathematics_a...

djtriptych · on April 20, 2018

Forgot to mention that those four engineers were same school AND same grade as me. So 4 out of 400 students were the tech guys behind Yext and FoundationDB.

inteleng · on April 22, 2018

Maybe it's just me, but the hubris in this comment seems a little excessive.

They aforementioned SWEs make themselves multimillionaires /and/ have great jobs /and/ get praised by their peers, yet now everyone else has to bow down to them... over claims they were great programmers in high school? Is an appeal to your own experience the best way to make yourself seem relevant here?

It's an incredible DB system, but it ain't the Second Coming. Calm down.

djtriptych · on April 23, 2018

Did you bow down at your laptop?

Is it ok to admire great engineers on HN?

Is this comment relevant to anything but your own inferiority complex?

Can we get a decent definition of hubris in here?

inteleng · on April 24, 2018

I don't bow down to my laptop, but I respect what went into making it.

It's great to admire excellent engineers; aspiring to be as skilled as someone at a task can be very motivating. Worshipping them is another thing.

You're right about the inferiority complex - I know I'm a relatively bad SW engineer, but that's mostly related to how new it is to me. I expect and want to improve.

Hubris is defined as excessive pride or self-confidence. I would say that bragging that you're "probably a 1% engineer," and that you've been hired by three of the largest SW companies out there qualifies as hubris. Maybe it's just me, but I don't think a 1% engineer would publicly boast about being one and then actually use the 'M$' in a non-farcical manner.

Shitting on 99% of the SWE population to make yourself look good, then shitting on yourself to make another person look even better doesn't really work. There's a reason humanCmp() is a little more complex than strCmp().

BTW, being hired by a large company doesn't mean you're all that. Plenty of idiots get hired by Oracle.

_xzxj · on April 19, 2018

Google please take a few notes here:

1. It’s in its own repo

2. The build instructions are concise and clear. Dependencies are listed. You have to follow a total of 0 links.

3. They use a common build system and not an in-house thing.

itp · on April 19, 2018

Speaking as the original author of this monstrosity of a build system, please be careful before offering praise here. To be clear, there is a top-level, non-recursive Makefile that uses the second expansion feature of GNU make, translating Visual Studio project files into generated Makefile inputs that are transformed into targets to power the build.

Although it starts by running `make`, it's about as in-house as a thing can be.

dboreham · on April 19, 2018

This kind of deep-inside-baseball from-the-horses-mouth interaction is what's so awesome about HN!

zbentley · on April 20, 2018

> deep-inside-baseball from-the-horses-mouth

Does the horse choke on the baseball? Is there an equine version of the Heimlich maneuver to be performed on horses suffering from mixed-metaphorical-adage-induced asphyxiation?

SiempreViernes · on April 20, 2018

Yes, you bite the hand that feeds the horse baseballs.

asimpletune · on April 20, 2018

This whole comments section is one of my all time favorites.

_xzxj · on April 19, 2018

Fair enough, I stand corrected. Superficially it seems like a more pleasant experience than dealing with gn/ninja as an end user.

itp · on April 19, 2018

As an end user it's absolutely better. I'm just torn between pride and embarrassment thinking about how it was implemented.

wwilson · on April 19, 2018

I think my favorite part of the build process was when we frobnicated libstdc++:

https://github.com/apple/foundationdb/blob/master/Makefile#L...

And then there were the hijinks we went through to build a cross-compiler with modern gcc and ancient libc (plus the steps to make sure no dependency on later glibc symbols snuck in):

https://github.com/apple/foundationdb/blob/master/build/link...

Ahh... now that was a build system.

aseipp · on April 19, 2018

I saw this in the Makefiles, and -- ah, the life of distributing proprietary Linux software. At one point in a prior job, we just replaced our build system with a wrapper Makefile that 'chroot'ed into a filesystem image that was a snapshot of one of our build machines, since it was so difficult to set up. This meant we (developers) had easier system updates, security upgrades, etc. That was just the tip of the iceberg!

e12e · on April 19, 2018

Now you can "... be beautiful and terrible as the Morning and the Night! Fair as the Sea and the Sun and the Snow upon the Mountain! Dreadful as the Storm and the Lightning! Stronger than the foundations of the earth. All shall love [you] and despair!" with a build inside docker; the host can run a modern kernel ; your image can run Slackware 0.1 ;-)

(with apologies to Tolkien)

modeless · on April 19, 2018

I'm curious what you think is wrong with gn/ninja (besides the fact that it's non-standard). My problems building Chromium come mostly from other parts of depot_tools.

boris · on April 19, 2018

Ah, GNU make second expansion... glad to see someone else appreciates it ;-).

panghy · on April 19, 2018

We (Wavefront) has been operating petabyte scale clusters for the last 5 years with FoundationDB (we got the source code via escrow) and we are super excited to be involved in the opensourcing of FDB. We have operated over 50 clusters on all kinds of aws instances and I can talk about all the amazing things we have done with it.

https://www.wavefront.com/wavefront-foundationdb-open-source...

panghy · on April 19, 2018

We basically replaced mySQL, Zookeeper and HBase with a single KV store that supports transactions, watches, and scales. It's not a trivial point that you can just develop code against a single API (finally Java 8 CompletableFutures) and not have to set up a ton of dependencies when you are building on top of FDB. We are (obviously) experts at monitoring FoundationDB with Wavefront and we hope to release the metric harvesting libraries and template dashboards that we use to do so.

Almost 5 years in and we have not lost any data (but we have lost machines, connectivity, seen kernel panics, EBS failures, SSD failures, etc., your usual day in AWS =p).

qaq · on April 19, 2018

"but we have lost machines, connectivity, seen kernel panics, EBS failures, SSD failures, etc., your usual day in AWS " <=== This I wish more people realized that is a day to day reality if you are in AWS at scale.

Joeri · on April 21, 2018

The best way I've heard it described is "complex systems run in degraded mode".

https://cdn.chrisshort.net/How-Complex-Systems-Fail.pdf

Basically once a system is complex enough some part if it is always broken. The software must be designed from the assumption that the system is never running flawlessly.

qaq · on April 27, 2018

No doubt but that's pretty high overhead for many projects colo is actually a decent choice but I guess that's not a popular opinion.

koide · on April 20, 2018

As I understand it, it's like that everywhere at scale, not just on AWS, it being a property of operating at scale.

Or are you saying that AWS is particularly unreliable at scale?

panghy · on April 20, 2018

I seem to think that cloud providers are particular opaque about small glitches (i.e. they aren't going to tell you that a router or switch was rebooted for maintenance if it comes back right away and you can email support and it's always the same response: "it's working right now") :)

qaq · on April 20, 2018

On the network side no, it's much more crappy on AWS.

koide · on April 20, 2018

Which provider is the best, network wise?

qaq · on April 20, 2018

I only have experience with AWS and on prem and high quality colo like Equinix. Possibly due to reduced complexity and having full control over networking setup but significantly fewer issues vs AWS.

killertypo · on April 19, 2018

And FoundationDB has held our data durable through all of this.

qaq · on April 19, 2018

Sounds like however bolts on PG compatible SQL layer on top will have a killer product on their hands :)

socceroos · on April 19, 2018

Have a look at CockroachDB

qaq · on April 19, 2018

Already playing with it but FoundationDB is used for production Petabyte scale deployments, and the whole deterministic simulation thing for testing is really reassuring as far as bugs/stability. I am guessing with Apple's resources that approach was taken to a whole new level after the acquisition?

osrec · on April 19, 2018

I can see everyone's extremely happy about this, which is great. As someone who's never used it, I'd like to know more about FoundationDB and how it compares to other offerings such as MySQL or Postgres, and which use cases is it most suited to. I would especially love to hear the thoughts of those with direct experience of using Foundation DB. Thanks!

degenerate · on April 20, 2018

Seconding this. Instead of hearing from bloggers or contributors, what is using this DB like in the trenches? What simple use case would fit this DB perfectly? Is there a lot of setup on second/third deploy? Easy to maintain, or requires a lot of tweaking/tuning? How is memory usage with 10M records? 100M?

rando444 · on April 20, 2018

Personally I'd be more interested in hearing how this compares to other distributed noSQL implementations like Cassandra.

roboyoshi · on April 20, 2018

It probably is better, because Apple switched from Cassandra to FoundationDB, according to all the rumors http://www.businessinsider.com/why-apple-bought-foundationdb... . But as we know Apple they probably won't tell us.

cooervo · on April 20, 2018

or CockroachDB

saagarjha · on April 19, 2018

It's under Apache 2.0 for those curious: https://github.com/apple/foundationdb/blob/master/LICENSE. Also, side note: it looks like this was a private GitHub repository for at least a couple months, since they have pull requests going back for at least that long. I find this interesting, since Apple normally "cleans up" history before open sourcing.

Longhanks · on April 19, 2018

Swift, too, was published on GitHub including its whole version control history, dating back to 2010 :)

openasocket · on April 19, 2018

I hadn't heard of FoundationDB before, so I did some digging into the features: https://apple.github.io/foundationdb/features.html . It seems to claim ACID transactions with serializable isolation, but also says later on that it uses MVCC, slower clients won't slow down operations, and that it allows true interactive queries. I didn't think an MVCC implementation could provide that level of isolation, and I'm not even sure how you provide that level of isolation and those other guarantees with any implementation, am I misunderstanding something?

voidmain · on April 19, 2018

I'll try to give you a quick introduction. The architecture talk I recorded for new engineers working on the product ran to four or five hours, I think :-). In short, it is serializable optimistic MVCC concurrency.

A FDB transaction roughly works like this, from the client's perspective:

1. Ask the distributed database for an appropriate (externally consistent) read version for the transaction

2. Do reads from a consistent MVCC snapshot at that read version. No matter what other activity is happening you see an unchanging snapshot of the database. Keep track of what (ranges of) data you have read

3. Keep track of the writes you would like to do locally.

4. If you read something that you have written in the same transaction, use the write to satisfy the read, providing the illusion of ordering within the transaction

5. When and if you decide to commit the transaction, send the read version, a list of ranges read and writes that you would like to do to the distributed database.

6. The distributed database assigns a write version to the transaction and determines if, between the read and write versions, any other transaction wrote anything that this transaction read. If so there is a conflict and this transaction is aborted (the writes are simply not performed). If not then all the writes happen atomically.

7. When the transaction is sufficiently durable the database tells the client and the client can consider the transaction committed (from an external consistency standpoint)

The implementations of 1 and 6 are not trivial, of course :-)

So a sufficiently "slow client" doing a read write transaction in a database with lots of contention might wind up retrying its own transaction indefinitely, but it can't stop other readers or writers from making progress.

It's still the case that if you want great performance overall you want to minimize conflicts between transactions!

alexashka · on April 19, 2018

Great write-up.

Is this similar to how Software Transactional Memory (STM) is implemented? It sounds very very similar indeed.

ccleve · on April 19, 2018

This is a good explanation of how it happens on a single node. What do you do when the transaction is distributed? How do you achieve consensus? Is there a write up on it anywhere?