This is INCREDIBLE news! FoundationDB is the greatest piece of software I’ve ever worked on or used, and an amazing primitive for anybody who’s building distributed systems.
The short version is that FDB is a massively scalable and fast transactional distributed database with some of the best testing and fault-tolerance on earth[1]. It’s in widespread production use at Apple and several other major companies.
But the really interesting part is that it provides an extremely efficient and low-level interface for any other system that needs to scalably store consistent state. At FoundationDB (the company) our initial push was to use this to write multiple different database frontends with different data models and query languages (a SQL database, a document database, etc.) which all stored their data in the same underlying system. A customer could then pick whichever one they wanted, or even pick a bunch of them and only have to worry about operating one distributed stateful thing.
But if anything, that’s too modest a vision! It’s trivial to implement the Zookeeper API on top of FoundationDB, so there’s another thing you don’t have to run. How about metadata storage for a distributed filesystem? Perfect use case. How about distributed task queues? Bring it on. How about replacing your Lucene/ElasticSearch index with something that actually scales and works? Great idea!
And this is why this move is actually genius for Apple too. There are a hundred such layers that could be written, SHOULD be written. But Apple is a focused company, and there’s no reason they should write them all themselves. Each one that the community produces, however, will help Apple to further leverage their investment in FoundationDB. It’s really smart.
I could talk about this system for ages, and am happy to answer questions in this thread. But for now, HUGE congratulations to the FDB team at Apple and HUGE thanks to the executives and other stakeholders who made this happen.
Now I’m going to go think about what layers I want to build…
[1] Yes, yes, we ran Jepsen on it ourselves and found no problems. In fact, our everyday testing was way more brutal than Jepsen, I gave a talk about it here: https://www.youtube.com/watch?v=4fFDFbi3toc
Will said what I wanted to say, but: me too. I'm super happy about this and grateful to the team that made it happen!
(I was one of the co-founders of FoundationDB-the-company and was the architect of the product for a long time. Now that it's open source, I can rejoin the community!)
Another (non-technical) founder here - and I echo everything voidmain just said. We built a product that is unmatched in so many important ways, and it's fantastic that it's available to the world again. Will be exciting to watch a community grow around it - this is a product that can benefit hugely from OS contributions as layers that sit on top of the core KV store.
I echo what wwilson has said. I work at Snowflake Computing, we're a SQL analytics database in the cloud, and we have been using FoundationDB as our metadata store for over 4 years. It is a truly awesome product and has proven to be rock-solid over this time. It is a core piece in our architecture, and is heavily used by all our services. Some of the layers that wwilson is talking about, we've built them. metadata storage , object-mapping layer , lock manager , notifications system . In conjunction with these layers, FoundationDB has allowed us to build features that are unique to Snowflake. Check out our blog titled, "How FoundationDB powers Snowflake Metadata forward" [1]
Kudos to the FoundationDB team and Apple for open sourcing this wonderful product. We're cheering you all along! And we look forward to contributing to the open source and community.
I am one of the designers of probably the best known metadata storage engine for a distributed filesystem, hopsfs - www.hops.io.
When I looked at FoundationDB before Apple bought you, you supported transactions - great. But we need much more to scale. Can you tell me which of the following you have:
row-level locks
partition-pruned index scans
non-serialized cross-partition transactions (that is, a transaction coordinator per DB node)
distribution-aware transactions (hints on which TC to start a transaction on)
It's somewhat hard to answer your questions because the architecture (and hence, terminology) of FoundationDB is a little different than I think you are used to. But I will give it a shot.
FoundationDB uses optimistic concurrency, so "conflict ranges" rather than "locks". Each range is a (lexicographic) interval of one or more keys read or written by a transaction. The minimum granularity is a single key.
FoundationDB doesn't have a feature for indexing per se at all. Instead indexes are represented directly in the key/value store and kept consistent with the data through transactions. The scalability of this approach is great, because index queries never have to be broadcast to all nodes, they just go to where the relevant part of the index is stored.
FoundationDB delivers serializable isolation and external consistency for all transactions. There's nothing particularly special about transactions that are "cross-partition"; because of our approach to indexing and data model design generally we expect the vast majority of transactions to be in that category. So rather than make single-partition transactions fast and everything else very slow, we focused on making the general case as performant as possible.
Transaction coordination is pretty different in FoundationDB than in 2PC-based systems. The job of determining which conflict ranges intersect is done by a set of internal microservices called "resolvers", which partition up the keyspace totally independently of the way it is partitioned for data storage.
Please tell me if that leaves questions unresolved for you!
Thanks for the detailed answer.
Is it actually serializable isolation - does it handle write skew anomalies (https://en.wikipedia.org/wiki/Snapshot_isolation)?
Most OCC systems I know have only snapshot isolation.
Systems that sound closest to FoundationDB's transaction model that i can think of are Omid (https://omid.incubator.apache.org/) and Phoenix (https://phoenix.apache.org/transactions.html). They both support MVCC transactions - but I think they have a single coordinator that gives out timestamps for transactions - like your "resolvers". The question is how your "resolvers" reach agreement - are they each responsible for a range (partition)? If transactions cross ranges, how do they reach agreement?
We have talked to many DB designers about including their DBs in HopsFS, but mostly it falls down on something or other. In our case, metadata is stored fully denormalized - all inodes in a FS path are separate rows in a table. In your case, you would fall down on secondary indexes - which are a must. Clustered PK indexes are not enough. For HopsFS/HDFS, there are so many ways in which inodes/blocks/replicas are accessed using different protocols (not just reading/writing files or listing directories, but also listing all blocks for a datanode when handling a block report). Having said that, it's a great DB for other use cases, and it's great that it's open-source.
Yes, it's really serializable isolation. The real kind, not the "doesn't exhibit any of the named anomalies in the ANSI spec" kind. We can selectively relax isolation (to snapshot) on a per-read basis (by just not creating a conflict range for that read).
I tried to explain distributed resolution elsewhere in the thread.
I believe our approach to indices pretty much totally dominates per-partition indexing. You can easily maintain the secondary indexes you describe; I don't understand your objection.
My guess is the objection lies in "Have to manage the index myself."
Also, the main draw-back of "indices as data" in NoSQL is when you need to add a new index -- suddenly, you have to scrub all your data and add it to the new index, using some manual walk-the-data function, and you have to make sure that all operations that take place while you're doing this are also aware of the new index and its possibly incomplete state.
Certainly not impossible to do, but it sometimes feels a little bit like "I wanted a house, but I got a pile of drywall and 2x4 framing studs."
"I wanted a house, but I got a pile of drywall and 2x4 framing studs."
This is a totally legitimate complaint about FoundationDB, which is designed specifically to be, well, a foundation rather than a house. If you try to live in just a foundation you are going to find it modestly inconvenient. (But try building a house on top of another house and you will really regret it!)
The solution is of course to use a higher level database or databases suitable to your needs which are built on FDB, and drop down to the key value level only for your stickiest problems.
Unfortunately Apple didn't release any such to the public so far. So I hope the community is up to the job of building a few :-)
I agree with voidmain’s comment as secondary indexes shouldn’t be any different than the primary KV in your case. Almost seems that you’re focusing on a SQL/Relational database architecture but storing your data demoralized anyways. Odd combination of thoughts.
Is that where the data sits around waiting, eager, to be queried into action while watching data around them being used over and over again.... But that time never comes, thus leaving them to question their very worth?
> Transaction coordination is pretty different in FoundationDB than in 2PC-based systems. The job of determining which conflict ranges intersect is done by a set of internal microservices called "resolvers", which partition up the keyspace totally independently of the way it is partitioned for data storage.
Ok, per my other question that makes sense. Similar to FaunaDB except the "resolvers" (transaction processors) are themselves partitioned within a "keyspace" (logical database) in FaunaDB for high availability and throughput. But FaunaDB transactions are also single-phase and we get huge performance benefits from it.
> best known metadata storage engine for a distributed filesystem, hopsfs
Not even close. I don't even see anything I'd call a filesystem mentioned on your web page. I missed FAST this year, and apparently you had a paper about using Hops as a building block for a non-POSIX filesystem - i.e. not a filesystem in my and many others' opinion - but it's not clear whether it has ever even been used in production anywhere let alone become "best known" in that or any other domain. I know you're proud, perhaps you have reason to be, but please.
I'm not convinced a paper with a website and some proof of concepts would be considered the "best". You're throwing around a bunch of components in to a distro calling yourselves everything from "deep learning" to a file system. It's not clear what you guys are even trying to do here.
You don't need to worry about shards / partitions with FDB. Their transactions can involve any number of rows across any number of shards. It is by far the best foundation for your bespoke database.
Your talk was one of the best talk i've seen , and i keep mentionning it to people whenever they ask me about distributed systems, database and testing.
i'm incredibly impatient to have a look at what the community is going to build on top of that very promising technology.
I'm not familiar with FDB but what you say sounds almost too good to be true. Can I use it to implement the Google Datastore api? I'm trying for years to find a suitable backend so that I can leave the Google land. Everything I tried either required a schema or lacked transactions or key namespaces.
If I recall correctly, the "SQL layer" you had in FDB before the Apple acquisition was a nice proof of concept, but lacked support for many features (joins in SELECT, for example). Is the SQL layer code from that time available anywhere to the public? (I'm not seeing it in the repo announced by OP.)
I used to work there. The SQL layer was capable of actually the majority of SQL features including joins, etc. We had an internal rails app that we used to dog-food for performance monitoring, etc. I used to work on the document layer, and was sad to see it wasn't included here.
I hope this question doesn't feel too dump, but is it possible to implement a SQL Layer using SQLite's Virtual Table mechanism and leverage all foundationdb's features?
It doesn't look like Apple open-sourced the Document Layer, which is a slight bummer. But I echo what Dave said below: what we got is incredible, let's not get greedy!
Also TBH now that I don't have commercial reasons to push interop, if I write another document database on top of FDB, I doubt I'd make it Mongo compatible. That API is gnarly.
Other than that, they totally did a fake it until you make it with MongoDB 3.4 passing Jepsen a year ago and MongoDB BI 2.0 containing their own SQL engine instead of wrapping PostgreSQL.
What specifically are you trying to avoid endorsing about the author of the LinkedIn post to which you linked? I couldn't find anything from a cursory web search.
He runs lambdaconf, and refused to disinvite a speaker who many people felt shouldn't be permitted to speak because of his historical non-technical writings.
(I've tried to keep the above as dry as possible to avoid dragging the arguments around this situation into this thread - and I suspect the phrasing of the previous comment was also intended to try and avoid that, so let's see if we can keep it that way, please)
You could try adding controversy to the author name and searching then. As mst correctly notes, I am trying to avoid reigniting said controversy while indicating my distaste.
I've forked the official SDK so that I can get extra functionality but it's quite hard to keep it updated when internal stuff changes. There is no way I can contribute.
I can't use it everywhere I want...shortly said it's not open source and this sucks.
MongoDB is AGPL or proprietary. Many companies have a policy against using AGPL licensed code. So, if you work at one of those companies, then open source MongoDB is not an option (at least for work projects), and proprietary may not be either (depending on budget etc).
FoundationDB is now licensed under Apache 2, which is a much more permissive license, so most companies' open source policies allow it.
Unless people want to change the mongodb code that they would be using, using the agpl software should be a non issue and there are not problems with it. People should start understanding the available licenses instead of spreading fear.
I know that multiple other companies have a similar policy (either a complete ban on using AGPL-licensed software, or special approval required to use it), although unlike Google, they don't post their internal policy publicly.
If someone works at one of these companies, what do you want to do – spend your day trying to argue to get the AGPL ban changed, or a special exception for your project; or do you just go with the non-AGPL alternative and get on with coding?
The main reason it's a problem at many of the companies which ban it is they have a lot of engineers who readily patch and combine code from disparate sources and might not always apply the right discipline to keep AGPL things appropriately separate. Bright-line rules can be quite useful.
It is true that MongoDB's AGPL contagion and compliance burden, if you don't modify it, is less than many fear. It is also true that those corporate concerns are valid. MongoDB does sell commercial licenses so that such companies can handle their unavoidable MongoDB needs, but they would tend to minimize that usage.
> How about replacing your Lucene/ElasticSearch index with something that actually scales and works?
Do you have something to back that up? This to me reads like you imply that Elasticsearch does not work and scale.
It's definitely interesting but I'm cautious. The track record for FoundationDB and Apple has not been great here. IIRC they acquired the company and took the software offline leaving people in the rain?
Could this be like it happened with Cassandra at Facebook where they dropped the code and then more or less stopped contributing?
Also I haven't seen many contributions from Apple to open-source projects like Hadoop etc. in the past few years. Looking for "@apple.com" email addresses in the mailing lists doesn't yield a lot of results. I understand that this is a different team and that people might use different E-Mail addresses and so on.
In general I'm also always cautious (but open-minded) when there's lots of enthusiasm and there seems to be no downside. I'm sure FoundationDB has its dirty little secrets and it would be great to know what those are.
They are, slowly. Swift is open source, clang is open source. They are moving parts of the xcode IDE into open source, like with sourcekitd and now recently clangd.
I don't think they will ever move 'secret sauce' into open source, but infrastructural things like DBs and dev tooling seems to be going in that direction.
~~How is it different from when Apple acquired the then-open-source FoundationDB (and shut down public access)? They could have just kept it open source back then.~~
EDIT: My bad, looks like FoundationDB wasn't fully open-source back then.
From what I recall (and based on some quick retro-googling) I don't believe Foudnation was open-source. One of the complaints about it on HN back in the day was that it was closed...
Unrelated to the original topic, but I had never come across that talk and it is great. I use the same basic approach to testing distributed systems (simulating all non-deterministic I/O operations) and that talk is a very good introduction to the principle.
Know of any ideas around using this for long term Time Series data? Wonder if maybe something like OpenTSDB but with this as backend instead of hbase (which can be a sort of operational hell)
It's more like you would build a better Elasticsearch using Lucene to do the indexing and FoundationDB to do the storage. FoundationDB will make it fault tolerant and scalable; the other pieces will be stateless.
It'd take a low number of hours to wire up FoundationDB as a Lucene filesystem (Directory) implementation. Shared filesystem with a local RAM cache has been practical for a while in Lucene, and was briefly supported then deprecated in Elasticsearch. I've used Lucene on top of HDFS and S3 quite nicely.
If you have a reason to use FoundationDB over HDFS, NFS, S3, etc, then this will work well.
Doing a Lucene+DB implementation where each entry posting lists are stored natively in the key-value system was explored for Lucene+Cassandra as (https://github.com/tjake/Solandra). It was horrifically slow, not because Cassandra was slow, but because posting lists are optimized and putting them in a generalized b-tree or LSM-tree variant will remove some locality and many of the possible optimizations.
I'm still holding out some hope for a hybrid implementation where posting list ranges are stored in a kv store.
I think you are on the right track. Storing every individual (term, document, ...) in the key value store will not be efficient, but you should be able to take Lucene's nice fast immutable data structure and stuff blocks of it (at the term level or below) into FDB values very efficiently. And of course you can do caching (and represent invalidation data structures in FDB), and...
FDB leaves room for a lot of creativity in optimizing higher layers. Transactions mean that you can use data structures with global invariants.
So from the Lucene perspective, the idea of a filesystem is pretty baked into the format. However, there's also the idea of a Codec which takes the logical data structures and translates to/from the filesystem. If you made a Codec that ignored the filesystem and just interacted with FDB, then that could work.
You can already tune segment sizes (a segment is a self-contained index over a subset of documents). I'd assume that the right thing to do for a first attempt is to use a Codec to write each term's entire posting list for that one segment to a single FDB key (doing similar things for the many auxiliary data structures). If it gets too big, then you should have tuned max segment size to be smaller. Do some sort of caching on the hot spots.
If anyone has any serious interest in trying this, my email is in my profile to discuss further.
Hmmm. I'm skeptical. A Lucene term lookup is stupidly fast. It traverses an FST, which is small and probably in memory. Traversing the postings lists itself also needs to be smart by following a skip table, which is critical for performance.
> you should be able to take Lucene's nice fast immutable data structure and stuff blocks of it (at the term level or below) into FDB values very efficiently.
That sounds a lot like Datomic's "Storage Resource" approach, too! Would Datomic-on-FDB make sense, or is there a duplication of effort there?
Datomic’s single-writer system requires conditional put (CAS) for index and (transaction) log (trees) roots pointers (mutable writes), and eventual consistency for all other writes (immutable writes) [0].
I would go as far as saying a FoundationDB-specific Datomic may be able to drop its single-writer system due to FoundationDB’s external consistency and causality guarantees [1], drop its 64bit integer-based keys to take advantage of FoundationDB range reads [2], drop its memcached layer due to FoundationDB’s distributed caching [3], use FoundationsDB watches for transactor messaging and tx-report-queue function [4], use FoundationDB snapshot reads [5] for its immutable indexes trees nodes, and maybe more?
Datomic is a FoundationDB layer. It just doesn’t know yet.
I wrote the original version of Solandra (which is/was Solr on Cassandra) on top of Jake's Lucene on Cassandra[1].
I can confirm it wasn't fast!
(And to be fair that wasn't the point - back then there were no distributed versions of Solr available so the idea of this was to solve the reliability/failover issue).
I wouldn't use it on a production system now days.
I just watched the demo of 5 machines and 2 getting unplugged. The remaining 3 can form a quorum. What happens if it was 3 and 3? Do they both form quorums?
A subset of the processes in a FoundationDB cluster have the job of maintaining coordination state (via disk Paxos).
In any partition situation, if one of the partitions contains a majority of the coordinators then it will stay live, while minority partitions become unavailable.
Nitpick: To be fully live, a partition needs a majority of the coordinators and at least one replica of each piece of data (if you don't have any replicas of something unimportant, you might be able to get some work done, but if you have coordinators and nothing else in a partition you aren't going to be making progress)
The majority is always N/2 + 1, where N is the number of members. A 6 member is less fault-tolerant than a 5 member cluster (quorum is 4 nodes instead of 3, and it still only allows for 2 nodes to fail).
The number of coordinators is separate from the number of boxes. You don't have to have a coordinator on every box.
I think you can set the number of coordinators to be even, but you never should - the fault tolerance will be strictly better if you decrease it by one.
Apple also uses HBase for Siri I believe, what are some of the cluster sizes that FoundationDB scales to? Could it be used to replace HBase or Hadoop?
I was in attendance at your talk, and thought it was one of the best at the conference. Apple I think broke some hearts completely going closed-source for a while, but glad to see them open sourcing a promising technology.
If scale is a function of read/writes, very large. In fact with relatively minimal (virtual) hardware it's not insane to see a cluster doing around 1M writes/second.
I was talking more about large file storage like HDFS, and the MapReduce model of bringing computation to data. HBase does the latter, and it's strongly consistent like FoundationDB, though FoundationDB provides better guarantees. As a K/V I understand what you and OP say.
How does this compare with CockroachDB? I'm planning to use CockroachDB for a project but would love to get an idea if I can get better results with FoundationDB.
They might be targeting the wrong market, hence the desperate marketing. For people who use MySQL/PostgreSQL a compatible, slower, but distributed database probably just doesn't solve any problem. Those people need a managed solution, not a distributed one.
That presentation was really good! Well explained on the simulations. If one wanted to get into this exciting event and create something with FoundationDB but no database experience (I do know many programming languages) where would I start? If anyone could point me in the direction, I'd greatly appreciate it.
How scalable and feasible to implement a SQL Layer on top of SQLite's Virtual Table Mechanism (https://www.sqlite.org/vtab.html) which redirects the read/write of the record data from/to foundationdb?
Long before we acquired Akiban, I prototyped a sql layer using (now defunct) sqlite4, which used a K/V store abstraction as its storage layer. I would guess that a virtual table implementation would be similar: easy to get working, and it would work, but the performance is never going to be amazing.
To get great performance for SQL on FoundationDB you really want an asynchronous execution engine that can take full advantage of the ability to hide latency by submitting multiple queries to the K/V store in parallel. For example if you are doing a nested loop join of tables A and B you will be reading rows of B randomly based on the foreign keys in A, but you want to be requesting hundreds or thousands of them simultaneously, not one by one.
Even our SQL Layer (derived from Akiban) only got this mostly right - its execution engine was not originally designed to be asynchronous and we modified it to do pipelining which can get a good amount of parallelism but still leaves something on the table especially in small fast queries.
@voidmain, Thank you, it's very insightful and clear! I mean, I can see the disadvantage if such SQL layer is implemented directly through SQLite's virtual tables.
Would it be possible to build a tree DB on top of it like MonetDB/Xquery? I always wondered why XML databases never took off, I've never seen anything else quit as powerful. Document databases if du jour seem comparatively lame.
Yes you can. You need a tree index basically. Any kv store can serve as the backing data structure. I've been writing one for config file bidirectional transformation.
MongoDB is just a database with less features than a SQL database. An XML/XQuery database is fundamentally different, so I figured if FoundationDB layers are really so powerful, they might be able to model a tree DB as well.
In some distributed databases the client just connects to some machine in the cluster and tells it what it wants to do. You pay the extra latency as it redirects these requests where they should go.
In FDB's envisioned architecture, the "client" is usually a (stateless, higher layer) database node itself! So the client encompasses the first layer of distributed database technology, connects directly to services throughout the cluster, does reads directly (1xRTT happy path) from storage replicas, etc. It simulates read-your-writes ordering within a transaction, using a pretty complex data structure. It shares a lot of code with the rest of the database.
If you wanted, you could write a "FDB API service" over the client and connect to it with a thin client, reproducing the more conventional design (but you had better have a good async RPC system!)
> but you had better have a good async RPC system!
The microservices crew with their "our database is behind a REST/Thrift/gRPC/FizzBuzzWhatnot microservice" pattern is still catching up to the significance of this statement.
This might be a dumb question (from someone used to using blocking JBDC) but why is async RPC important in this case? Just trying to understand. And can gRPC not provide good async RPC?
I was referring to the trend of splitting up applications into highly distributed collections of services without addressing the fact that every point where they communicate over the network is a potential point of pathological failure (from blocking to duplicate-delivery etc). This tendency replaces highly reliable network protocols (i.e. the one you use to talk to your RDBMS) with ad hoc and frequently technically shoddy communication patterns, with minimal consideration for how it might fail in complex, distributed ways. While not always done wrong, a lot of microservice-ification efforts are quite hubristic in this area, and suffer for it over the long term.
Wouldn't layers be hard to be built on the server (since you have to also change the client) and slow to be built as a layer (since it will be another separate service) ?
I'm not sure what you are asking, but depending on their individual performance and security needs layers are usually either (a) libraries embedded into their clients, (b) services colocated with their clients, (c) services running in a separate tier, or (d) services co-located with fdbservers. In any of these cases they use the FoundationDB client to communicate with FoundationDB.
In case (c) or (d) how can a layer leverage the distributed facilities that FDB gives?
I mean if I have clients that connect to a "layer service" that is the one who talks to FDB, I have to manage "layer service" scalabily, fault tolerance etc... by myself.
Yes, and that's the main advantage of choosing (a) or (b). But it's not quite as hard as it sounds; since all your state is safely in fdb you "just" have to worry about load balancing a stateless service.
got it, what will you suggest to do something like that? a simple RPC with a good async framework I've read,
like what? an RPC service on top of Twisted for python, similar things in other languages?
Postgres operates great as a document store, btw. You don’t really need mongo at all. And if you need to distribute because you’ve outgrown what you can do one a single postgres node, you don’t want to use mongo anyway.
If you’ve read any of the comments or have been following the project, it should be pretty obvious that this is far from rookie.
This is a game changer, not a hobby project. This is the first distributed data store that offers enough safety and a good enough understand of CAP theorem trade offs that it can be safely used as a primary data store.
Or perhaps it's not so incredible? Maybe it wasn't such a huge hit for Apple and didn't leave up to expectation so they figure they can give it away and earn some community goodwill.
This is great news, when I was with dynamo, FoundationDB was the other green shore for me :). They did so many things so well.
A tiny bit of caution for folks trying to run systems like this though: It is frigging hard at any reasonable scale. The whole thing might be documented / OSS and what not, but very soon you are going to run into deep enough problems that's going to require very core knowledge to debug, energy to deep dive. Both of which you probably don't want to invest your time into. Do evaluate the cloud offerings / supported offerings before spinning these up. Else ensure you have hired experts who can keep this going. They are great as a learning tool, pretty hard as an enterprise solution. I have seen the same issue a ton of times with a bunch of software (redis/kafka/cassandra/mongo...) by now. IMO In the stateful world, operating/running the damn thing is 85% of the work, 15% being active dev. (Stateless world is a little better, but still painful).
> IMO In the stateful world, operating/running the damn thing is 85% of the work, 15% being active dev. (Stateless world is a little better, but still painful).
The number of newbie engineers who see docker/kubernetes, and think "let me docker run or helm install" a stateful service in a couple of minutes - is mind boggling.
I'll remember this quote when trying to talk sense to them.
I went to the same high school as the founders[1]. They were about the 2 best software engineers in a school with a LOT of very smart software engineers. Another pair founded Yext, which went public last year. I still consider that school the group with the highest concentration of raw brain power I've ever been a part of.
I'm probably a 1% engineer, been hired by M$, FB, and Google. These guys were light years ahead of me. I'm not sure I'm as good now as they were at like 17 years old. In fact I'm probably only a decent engineer from having observed the stuff they were doing back then and finding inspiration.
I went to UVA and I think about half of the engineering school was from TJ (including 3 of my roommates). :) I can't think of any superior public high school (and I went to a different Governor's School in Virginia myself) anywhere, due to the amazingly large and deep talent pool TJ pulls from. Nothing like it exists in the Bay Area, that's for sure!
Just checked the rankings. My first high school made #172 on the list ("International School of the Americas").
My second high school is unranked ("Texas Academy of Math and Science"). I don't think it qualifies as a high school. Seems we haven't done a great job of identifying the accomplishments of our alumni, based on the Wikipedia page. No doubt it would rank near the top though. My year alone Caltech accepted about 30 of us, more than any other high school in the country. Makes me wonder what my peers have been up to.
Anyway, I'd agree that these tech high schools have some amazingly smart people attending them.
Forgot to mention that those four engineers were same school AND same grade as me. So 4 out of 400 students were the tech guys behind Yext and FoundationDB.
Maybe it's just me, but the hubris in this comment seems a little excessive.
They aforementioned SWEs make themselves multimillionaires /and/ have great jobs /and/ get praised by their peers, yet now everyone else has to bow down to them... over claims they were great programmers in high school? Is an appeal to your own experience the best way to make yourself seem relevant here?
It's an incredible DB system, but it ain't the Second Coming. Calm down.
I don't bow down to my laptop, but I respect what went into making it.
It's great to admire excellent engineers; aspiring to be as skilled as someone at a task can be very motivating. Worshipping them is another thing.
You're right about the inferiority complex - I know I'm a relatively bad SW engineer, but that's mostly related to how new it is to me. I expect and want to improve.
Hubris is defined as excessive pride or self-confidence. I would say that bragging that you're "probably a 1% engineer," and that you've been hired by three of the largest SW companies out there qualifies as hubris. Maybe it's just me, but I don't think a 1% engineer would publicly boast about being one and then actually use the 'M$' in a non-farcical manner.
Shitting on 99% of the SWE population to make yourself look good, then shitting on yourself to make another person look even better doesn't really work. There's a reason humanCmp() is a little more complex than strCmp().
BTW, being hired by a large company doesn't mean you're all that. Plenty of idiots get hired by Oracle.
Speaking as the original author of this monstrosity of a build system, please be careful before offering praise here. To be clear, there is a top-level, non-recursive Makefile that uses the second expansion feature of GNU make, translating Visual Studio project files into generated Makefile inputs that are transformed into targets to power the build.
Although it starts by running `make`, it's about as in-house as a thing can be.
Does the horse choke on the baseball? Is there an equine version of the Heimlich maneuver to be performed on horses suffering from mixed-metaphorical-adage-induced asphyxiation?
And then there were the hijinks we went through to build a cross-compiler with modern gcc and ancient libc (plus the steps to make sure no dependency on later glibc symbols snuck in):
I saw this in the Makefiles, and -- ah, the life of distributing proprietary Linux software. At one point in a prior job, we just replaced our build system with a wrapper Makefile that 'chroot'ed into a filesystem image that was a snapshot of one of our build machines, since it was so difficult to set up. This meant we (developers) had easier system updates, security upgrades, etc. That was just the tip of the iceberg!
Now you can "... be beautiful and terrible as the Morning and the Night! Fair as the Sea and the Sun and the Snow upon the Mountain! Dreadful as the Storm and the Lightning! Stronger than the foundations of the earth. All shall love [you] and despair!" with a build inside docker; the host can run a modern kernel ; your image can run Slackware 0.1 ;-)
I'm curious what you think is wrong with gn/ninja (besides the fact that it's non-standard). My problems building Chromium come mostly from other parts of depot_tools.
We (Wavefront) has been operating petabyte scale clusters for the last 5 years with FoundationDB (we got the source code via escrow) and we are super excited to be involved in the opensourcing of FDB. We have operated over 50 clusters on all kinds of aws instances and I can talk about all the amazing things we have done with it.
We basically replaced mySQL, Zookeeper and HBase with a single KV store that supports transactions, watches, and scales. It's not a trivial point that you can just develop code against a single API (finally Java 8 CompletableFutures) and not have to set up a ton of dependencies when you are building on top of FDB. We are (obviously) experts at monitoring FoundationDB with Wavefront and we hope to release the metric harvesting libraries and template dashboards that we use to do so.
Almost 5 years in and we have not lost any data (but we have lost machines, connectivity, seen kernel panics, EBS failures, SSD failures, etc., your usual day in AWS =p).
"but we have lost machines, connectivity, seen kernel panics, EBS failures, SSD failures, etc., your usual day in AWS " <=== This I wish more people realized that is a day to day reality if you are in AWS at scale.
Basically once a system is complex enough some part if it is always broken. The software must be designed from the assumption that the system is never running flawlessly.
I seem to think that cloud providers are particular opaque about small glitches (i.e. they aren't going to tell you that a router or switch was rebooted for maintenance if it comes back right away and you can email support and it's always the same response: "it's working right now") :)
I only have experience with AWS and on prem and high quality colo like Equinix. Possibly due to reduced complexity and having full control over networking setup but significantly fewer issues vs AWS.
Already playing with it but FoundationDB is used for production Petabyte scale deployments, and the whole deterministic simulation thing for testing is really reassuring as far as bugs/stability. I am guessing with Apple's resources that approach was taken to a whole new level after the acquisition?
I can see everyone's extremely happy about this, which is great. As someone who's never used it, I'd like to know more about FoundationDB and how it compares to other offerings such as MySQL or Postgres, and which use cases is it most suited to. I would especially love to hear the thoughts of those with direct experience of using Foundation DB. Thanks!
Seconding this. Instead of hearing from bloggers or contributors, what is using this DB like in the trenches? What simple use case would fit this DB perfectly? Is there a lot of setup on second/third deploy? Easy to maintain, or requires a lot of tweaking/tuning? How is memory usage with 10M records? 100M?
It's under Apache 2.0 for those curious: https://github.com/apple/foundationdb/blob/master/LICENSE. Also, side note: it looks like this was a private GitHub repository for at least a couple months, since they have pull requests going back for at least that long. I find this interesting, since Apple normally "cleans up" history before open sourcing.
I hadn't heard of FoundationDB before, so I did some digging into the features: https://apple.github.io/foundationdb/features.html . It seems to claim ACID transactions with serializable isolation, but also says later on that it uses MVCC, slower clients won't slow down operations, and that it allows true interactive queries. I didn't think an MVCC implementation could provide that level of isolation, and I'm not even sure how you provide that level of isolation and those other guarantees with any implementation, am I misunderstanding something?
I'll try to give you a quick introduction. The architecture talk I recorded for new engineers working on the product ran to four or five hours, I think :-). In short, it is serializable optimistic MVCC concurrency.
A FDB transaction roughly works like this, from the client's perspective:
1. Ask the distributed database for an appropriate (externally consistent) read version for the transaction
2. Do reads from a consistent MVCC snapshot at that read version. No matter what other activity is happening you see an unchanging snapshot of the database. Keep track of what (ranges of) data you have read
3. Keep track of the writes you would like to do locally.
4. If you read something that you have written in the same transaction, use the write to satisfy the read, providing the illusion of ordering within the transaction
5. When and if you decide to commit the transaction, send the read version, a list of ranges read and writes that you would like to do to the distributed database.
6. The distributed database assigns a write version to the transaction and determines if, between the read and write versions, any other transaction wrote anything that this transaction read. If so there is a conflict and this transaction is aborted (the writes are simply not performed). If not then all the writes happen atomically.
7. When the transaction is sufficiently durable the database tells the client and the client can consider the transaction committed (from an external consistency standpoint)
The implementations of 1 and 6 are not trivial, of course :-)
So a sufficiently "slow client" doing a read write transaction in a database with lots of contention might wind up retrying its own transaction indefinitely, but it can't stop other readers or writers from making progress.
It's still the case that if you want great performance overall you want to minimize conflicts between transactions!
This is a good explanation of how it happens on a single node. What do you do when the transaction is distributed? How do you achieve consensus? Is there a write up on it anywhere?
The only thing that's different in a distributed cluster is the implementations of steps 1 and 6. As voidmain said, the details of that are not trivial, ESPECIALLY the details of how it never produces wrong answers during fault conditions.
I don't know that there's been an exhaustive writeup of that part, but maybe one of us or somebody on the Apple team will put something together. It probably won't fit in an HN comment though!
Or... maybe this is the part where I point out that the product is now open-source, and invite you to read the (mostly very well commented) code. :-)
Going forward as an opensource product, I hope to see some clarity on the "how it works"... Distributed, performant ACID sounds good, almost too good to be true. Not that I doubt it at the moment, I just want to understand it better :)
Thanks. Can you elaborate on how 6 is actually accomplished? Various earlier comments have hinted that the transactional authority (conflict checking) can actually scale 'horizontally' beyond the check-throughput that can be archived by a single node. Is that the case? and whats the magic sauce for doing that for multi-object transactions? :)
Yes, conflict resolution is for most workloads a pretty small fraction of total resource use so you usually don't need a ton of resolvers (I think out of the box it still comes configured with just one?), but it can scale conflict resolution horizontally.
The basic approach isn't super hard to understand, though the details are tricky. The resolvers partition the keyspace; a write ordering is imposed on transactions and then the conflict ranges of each transaction are divided among the resolvers; each resolver returns whether each transaction conflicts and transactions are aborted if there are any conflicts.
(In general the resolution is sound, but not exact - it is possible for a transaction C to be aborted because it conflicts with another transaction B, but transaction B is also aborted because it conflicts with A (on another resolver), so C "could have" been committed. When Alec Grieser was an intern at FoundationDB he did some simulations showing that in horrible worst cases this inaccuracy could significantly hurt performance. But in practice I don't think there have been a lot of complaints about it.)
Of course. FDB thinks about read and write conflict ranges, which are functions of the keys, not the values (or lack thereof). A read of a non-existent key conflicts with a write to that key. A read of a range of keys conflicts with a write to a key in that range, even if that key did not have an associated value at the time of the original read.
> I didn't think an MVCC implementation could provide that level of isolation
MVCC needs a bit of additional logic ontop to be serializable - "Serializable snapshot isolation" is a good
keyword to search for - But it's definitely possible.
A lot of people apparently hated that I attacked this person, and downvoted me; but what the person who posted this comment was doing is intentionally throwing shade at and casting doubt on a project by saying "I think that the claims on this webpage don't make any sense as this should be impossible". It comes off as "oh come on, this doesn't even sound plausible; I call bullshit".
This then causes people who aren't versed with the product or the technology to decrease their perception of the product, and puts the team behind it in a position of having to not just come to its defense but to do so quickly due to the perception concerns.
Meanwhile, if they just do a basic search for "MVCC serializable" they would find that they were wrong; which means that it took more time to leave this insulting comment than it would have taken them to learn how this can work.
As a community, we really really really really need to beat down on casual cynics like this, who like to lazily "call bullshit" or play the "citation needed" card as a way to undermine the credibilty of other peoples' products. We live in a future where the answers to these kinds of doubts are a moment away: this particular form of debate tactic needs to die.
I absolutely was not throwing shade at the project, I was just trying to understand. I have a great interest in distributed systems that can efficiently implement serializable ACID transactions, and looked at their documentation with great interest to see how they implemented it. I honestly thought that MVCC could only efficiently implement Snapshot Isolation. So I figured my understanding was wrong, or at worst there's a typo in their documentation and they use something other than MVCC under the hood. Either way, I wanted to know more about their architecture.
I'll take your feedback under consideration, and I'm sorry to have irritated you.
I'm very interested in hearing more about what running FoundationDB in production is like.
I believe that FoundationDB stores rows in lexicographical order by key. Other databases like Cassandra strongly push you toward not storing data this way as it can easily lead to hotspots in the cluster. How do you deploy a FoundationDB cluster without leading to hotspots, or perhaps what operational actions are available to rebalance data?
Cassandra allows you to configure a "partitioner" that determines which nodes a primary key belongs to. There is a ByteOrderPartitioner that stores partitions in order lexicographically by primary key. There is also a Murmor3 hash based partitioner (which is the recommended default).
Cassandra allows you to store multiple records in sorted order within a partition. The normal recommended way to get data locality is to store records that are frequently accessed together in the same partition.
I am not a database expert by any means but have been curious about distributed data systems and had not heard of FoundationDB till now and was very excited to read about it.
On reading through the documentation, I encountered a section on "Known Limitations"[1] which stated that keys could not be larger than 10kb and values cannot be larger than 100kb. This seems to be a major limitation. Am I missing something or is this strictly for storing text?
Because the data model is ordered, large blobs can and normally should be mapped to a bunch of adjacent keys and read with a range read, not a single huge value. That also allows you to read or write just part of one efficiently.
Worked a job once where that was the underlying data store.
I was only allowed to touch the SQL interface to it, which was....weird.
The SQL dialect was ancient, felt like something from about 1990 (and this was in.... 2012 or so, so not THAT long ago).
Query performance seemed invariant. A simple select * from foo where id=X and a monster programatically generated join across 15 tables would both take about 1.5 seconds to return results.
Yeah, when I read about that I thought it sounded neat—make sure the index updates with the data. On the other hand, I think of CAP theorem as an iron triangle. If you are gaining consistency, what’s the trade off?
This _is_ the usual trade off, but what makes FoundationDB so crazy is that it's a CP system that has a performance profile that AP systems would have a hard time matching.
Wow this is some really exciting news! I think it would be amazing to create a GraphQL API for FoundationDB. Therefore i have created a feature request for this in the Prisma repo. For those who don't know Prisma is a GraphQL database mapper for various databases.
How's this compare to https://github.com/pingcap/tikv? It's a relatively new distributed KV store written in Rust that also is transactional and backs the new TiDB database.
Looks promising, but does anyone know if FoundationDB has external events or triggers, similar to Firebase or RethinkDB? I can't seem to find much on it.
If not, then a lot of potential is being left on the table, because usage would require wrapping FoundationDB in a proxy or middleware of some kind to synthesize events, which can be extremely difficult to get right (due to race conditions, atomicity issues, etc). Without events, apps can find themselves polling or rolling their own pub/sub metaphor over and over again. If anyone with sway is reading this, events are very high on the priority list for me thanx!
In addition to single-key asynchronous watches, there are also versionstamped ops (for maintaining your own, sophisticated log in a layer) and configurable key range transaction logging (but see the caveats in my other post on the topic).
I'm not sure it has every feature it will ever need in this area, but it's a pretty good starting point for building "reactive" stuff.
Your response is both very exciting and slightly intimidating! Would love to see a “NewSQL” and/or event store built on top of FDB tech. Would the key ranges and versioned ops be capable of providing/emulating a performant “atomic broadcast” similar to that in Kafka?
Versionstamped operations and transaction logging are fully transactional. Watches are asynchronous: they are used to optimize a polling loop that would "work" without them.
That's from sqlite. (Awesome tech, awesome license).
Related snippet from the "Distinctive Features Of SQLite" page[1] from the sqlite project:
The source code files for other SQL database engines typically begin with a comment describing your legal rights to view and copy that file. The SQLite source code contains no license since it is not governed by copyright. Instead of a license, the SQLite source code offers a blessing:
May you do good and not evil
May you find forgiveness for yourself and forgive others
May you share freely, never taking more than you give.
The storage engine is and always was a fairly heavily modified asynchronous version of sqlite's btree. It's been extremely reliable, which was always our top priority, and the performance isn't bad. But honestly when there was a problem with it our development velocity improving it wasn't great.
It's super easily pluggable[1], so now that it is open source people can experiment with other engines. I think there is a lot of room for improvement. Also architecturally it's designed in anticipation of being able to run different storage engines for different key ranges and for different replicas. For example, you might keep one replica in a btree on SSD (for random reads) and two on spinning disks in a log structured engine.
It looks to me like Apple has made a pretty complete release of the key/value store. What's missing is
(1) Layers! Everything from relational databases to full text search engines to message queues
(2) Monitoring stuff. Unsurprisingly it doesn't look like we have the tools for monitoring log files, etc. Wavefront (also a major user!) is a great commercial solution, but there should be something OSS
Truthfully at Wavefront we've taken the json status directly into telegraf. Plus a bunch of python tooling to massage additional telemetry on a clusters health (coordinator reachability for example).
Plus even more tooling (mostly Ansible) for managing large fleets.
How does this stack up against HBase and Cassandra, which seem to have gotten traction already in the same areas that FoundationDB seems best suited for?
HBase and Cassandra both provide much weaker guarantees. At best they can support compare-and-set operations that are local to a single region/"row", whereas FoundationDB lets you do optimistic transactions (and consistent read snapshots) on the entire database. And at least in cassandra even getting that level of consistency comes at a big performance cost.
The benefit of ACID transactions isn't just safety at the application level, it's the ability to compose abstractions and build complex data models on top of simple ones. For example, indexes in FoundationDB are typically more scalable than in the majority of other systems, where index queries often have to be broadcast to all the systems in a cluster. Yet FoundationDB doesn't even have indexes as a feature - higher layers build and maintain index invariants using transactions.
FoundationDB is also just really reliable and fault tolerant. Its testing story is drastically better than what the teams building these other products are doing.
An interesting fact is that Apple is probably one of the biggest users, if not the biggest user, of Cassandra out there.
Can't speak to HBase, but one thing Cassandra doesn't guarantee is ACID - I've seen some data consistency issues that has arisen from Cassandra in our usage, although it hasn't been a huge problem for us. That difference alone probably brings a lot of value to FoundationDB.
My understanding of hbase (which could be wrong) is that writes of a particular key always go to the region master first, so if you read from the master you always get the latest value of a key. The tricky part is that when the region master goes offline another one needs to take its place, and you can get inconsistency or unavailability depending on how it is set up.
I’d like to see a deep dive of how foundationdb handles this. It has to trade-off consistency for availability at some point and it would be nice to know exactly where.
By default in HBase, each region is served by a single region server. Writes and reads always go to that regionserver, and are consistent across a single row (HBase is a CP system). In this mode, when a regionserver goes offline its regions are (eventually) reassigned to a different region server. During this process the region is unavailable, but there's no loss of consistency.
Hbase 0.98 (I think?) introduced a feature called "timeline consistency" that allows reads from replica regionservers. This has to be enabled for a table and has to be specified on the query side. If it is, you have the option of falling back to the replica if the primary doesn't respond within a deadline. This may be a good tradeoff if you value availability over consistency.
The core product is a distributed, highly fault tolerant ordered Key Value store with true serializable ACID transactions. All of the layers (including document, graph) sit on top of that and inherit its ACID properties, scalability, fault tolerance, etc. It doesn't appear to me that they released any of the top level layers, but those are MUCH simpler to build, and that's where the OS community can step in.
FoundationDB at its core is not a graph database. You could build a graph database on top of it, using FoundationDB as a very strong and feature rich storage engine, however you'd like. It would be much simpler to do than building a new (especially a distributed) graph database from scratch.
You can think of FDB as a distributed storage engine. It has the same low level data model as the engines you mention, but has distributed transactions, fault tolerance, automatic data partitioning, operational tooling, etc built in. So if you build an "application level database" or library on top of it, it is automatically a distributed database.
Yes, but with an important caveat: Getting ACID transactions to work correctly across multiple machines is the really hard part. FDB is providing a KV store (across multiple machines) kind of like RocksDB does (on one machine), but it's also solving that really hard problem for you. That puts it in kind of a different league.
Well, it's infrastucture. Moreover, it's infrastructure for infrastructure! So if that sounds super boring you don't have to be excited :-)
But I would tell the story something like this: state storage is the root of (almost) all operational evil. It's very easy to make a system reliable if it's totally stateless. Even most bugs can be lived with if the worst you have to do is restart a service and carry on! But to do anything interesting you have to store state somewhere, and you have to modify that state concurrently without screwing it up.
And the many challenges of operating stateful systems are greatly multiplied if you have a lot of different ones. For example, if you have a datacenter outage and some but not all of your stateful systems deal with it correctly, probably your application as a whole is still down.
So as one more stateful system, does FoundationDB just make that worse? Well, FoundationDB is designed specifically to be a foundation for many very different stateful systems - not just different kinds of databases but things like search engines or message queues that you normally don't think of in the same category. So that almost any system can map to it efficiently, it has a lowest common denominator data model (key/value) and the highest possible guarantees in terms of concurrency control. And you can run diverse systems supporting an application on the same FoundationDB cluster, or on different clusters with the same exact operational requirements.
Some few users of FoundationDB have been able to get the benefits of this vision, consolidating lots of different stuff into a single, operationally desirable system. But for more people to be able to, not just does the key/value store have to be available to them, but also lots of stuff has to be built on top of it. By releasing FoundationDB under a very liberal open source license, Apple has hopefully made that possible. In the long run, hopefully it will make all server-side computing more reliable.
Also, it's a really good key/value store, if you happen to need one of those!
I noticed that all the write benchmarks in https://apple.github.io/foundationdb/benchmarking.html are for random writes. Is write throughput affected by highly-sequential writes (e.g. - time series) vs random writes? How do you avoid hot-spotting on recent ranges?
How efficient are range deletes?
On https://apple.github.io/foundationdb/performance.html I read "The memory engine is optimized for datasets that entirely fit in memory, with secondary storage used for durable writes but not reads." I'd like some clarification:
(1) Which memory does "entirely fit in memory" refer to? A single machine? Or SingleNodeMemory * Nodes / ReplicationFactor?
(2) If only recently-written data is likely to be queried, and all recently-written data fits entirely in memory, is that sufficient? If so, would an unexpected query of old data cause a huge impact on write throughput?
(3) What is the structure/format of the data stored on disk? How is it updated?
I'm wondering how well this could be used for time series data. I saw mention here that wavefront uses FoundationDB for this, but would like more details if any are available.
Sequential writes should be a little faster than random at the individual storage node level, but if your entire write workload is a single ordered log scalability will suffer. It might be theoretically possible for fdb to scale in this situation by creating shards on the fly during transaction processing, but no one has seriously tried to make that work.
You can mitigate by designing your key structure/data ordering to not have that property.
The memory engine requires your data to fit in memory (total across all your nodes, after replication). It writes interleaved snapshots and updates to disk, and reads the whole dataset back into memory when restarted.
You can do great modeling of time series data in FDB, though it will take some care and thought.
You should ask these questions on the forum. This article is falling off HN, I am going to lose track of it, and it doesn't look like the Apple team is answering questions here.
This is wonderful news. I built a proof-of-concept realtime collaborative editor on top of foundationdb a few years ago, and was very disappointed when I couldn't use it in production. I'm really excited to use this in some projects I'm working on.
Quick question: I know there's a watch API, but is there any way to subscribe to a change feed from foundationdb? I'd like to consume the FDB event log to do external indexing & map-reduce work.
Yes. I don't know how well documented it is, but there is an API (well, system keyspace) that can configure the database to log transactions for a selected key range (up to and including the whole database) into another selected key range. It is used by backup and asynchronous replication tools. The format of the configuration keys and transaction logs should be considered less stable than the core key/value API, which basically never breaks backward compatibility, so by using it you are shouldering a maintenance burden to keep up with e.g. changes in the log format and new types of mutations when new versions of FoundationDB come along. So it shouldn't be used too casually. But it's there.
Alternatively, your application or layer can use the "versionstamp" atomic operations to write its own ordered log of what it is doing, or other indexing tricks. Depending on your data model this might be able to be much more efficient. For example, for external indexing you probably don't need to preserve a history of prior values but only be able to identify all the values that have changed. This can be done with a very simple and compact index that doesn't need to duplicate all the data to be indexed.
> Alternatively, your application or layer can use the "versionstamp" atomic operations to write its own ordered log of what it is doing, or other indexing tricks.
I'm not sure I understand.
Are you suggesting having a second key space at `ops/{VERSIONSTAMP}` or something where values contain enough information about the operation to be able to process changes in an indexer? The indexer could then clean up after itself, deleting the operations once they had been ingested? ... Effectively using a portion of the keyspace as a queue?
Yes. Or if it's not important for the indexer to process things chronologically, you could just have an index of the primary key (only) of records that haven't been indexed.
If you are trying to make your external index MVCC, then you will want to carry some version information too.
This kind of question might be better served by the new community forum you can get to from the website!
It's closest to TiDB's key-value layer; a building block for more complex systems. More traditional, monolithic databases like CockroachDB (SQL) or FaunaDB (NoSQL) trade off extensibility for the benefits in performance and operations that come from very tight coupling.
In my understanding, FoundationDB's transaction management is closest to FaunaDB's; read/write sets are linearized in memory in preprocessing nodes and distributed asynchronously to the replicas rather than locked on the replica leaders like Spanner or CockroachDB. This is why FoundationDB doesn't support long-lived transactions.
It's interesting that the FoundationDB team chose to unwind their service architecture (there used to be separate transaction manager and replica processes), I assume in the interests of ease of operations.
It is not clear to me how leader election and failover works for the transaction management role. Maybe somebody from the team can clarify.
No, it's always been possible to have as little as one fdbserver process and have a complete key/value store. Internally it is "microservices" though - it will start a "proxy", a "resolver", a "log", a "storage", etc within that one process.
I mean it in the terms of monolithic process vs. service-oriented architecture, distinct from a distributed vs. centralized operational topology.
FaunaDB and CockroachDB are implemented as monolithic processes and can break encapsulation boundaries for performance reasons. For example, FaunaDB does aggressive predicate pushdown to accelerate intersections and joins, which you cannot do if you have to conform to a key/value interface exclusively. It can also eliminate all network overhead for query data that's local to the processing node.
I understand how that terminology is confusing though...how would you explain it?
I think the use of monolithic here is that CockroachDB is higher level than FoundationDB. The project was not designed to be your do-everything-DB that you layer different systems on, but has a specific goal and is intended to be used directly.
FoundationDB was a company that existed in the market, licensing our database technology for quite a while. Licenses don't necessarily terminate upon acquisitions.
I had always assumed they acquired the team more than the ip. I'm not sure this confirms that or not. I'd be curious what the answer to this question is as well, but wouldn't be surprised if the answer was "they don't"
Question for the FoundationDB gurus that are hanging out on this thread: How well does it deal with spotty connectivity? I'm asking because I work on mobile robots, and WiFi and/or LTE connections are always coming and going in unpredictable ways as the vehicle moves about it's environment. Reconnecting every few minutes is normal.
You should have seen their trade show demos, where they'd demonstrate fault tolerance by literally powering down random computers and unplugging network cables, while visualizing how the system managed fault tolerance.
The fault tolerance is pretty much flawless. You won't be able to get the database "stuck" or see anomalies.
But performance is going to suck if you run server nodes over unreliable connections. I have trouble seeing a FoundationDB cluster running on mobile robots as more than a trade show gimmick. Albeit an awesome gimmick. So in summary you should totally do that.
I can see that. There are various vectors to performance. I hear you saying transactions per second would be unimpressive.
A couple of less time sensitive applications are: 1. distrubuting information the entire fleet should eventually know, 2. event log aggregation with fine-grained time alignment among nodes.
Both are probably silly problems to solve with a database, killing houseflys with sledghammers and all that, but it never hurts to explore creative tool misuse :)
Spanner (and to an extent its less mature OSS descendants Cockroach and TiKV) has more comparable goals, but is fairly different architecturally. For example, FoundationDB only requires N+1 replicas instead of 2N+1 to achieve N failure tolerance (even lots of databases with much weaker guarantees are in the latter category!), doesn't trust clocks at all, doesn't lose performance when transactions cross replica sets, and uses optimistic instead of pessimistic concurrency.
Also FoundationDB (and TiKV) make a distributed, transactional key/value store available as an API, while Spanner and Cockroach expose only a relational database layer. FoundationDB is designed philosophically with the idea that you want to have a single storage layer to manage operationally but should be able to mix and match data models and query engines above that layer.
On the other hand, FoundationDB doesn't currently have any full fledged high level database layer available. Someone will probably dig up our SQL layer (which was AGPL, I think) but I wouldn't really recommend using it in production because there is no active development team. Someone will probably try porting the SQL layers from TiDB and Cockroach.
Maybe Apple will open source more stuff in the future, but let's not get too greedy!
The datacenter-aware mode documentation [0] says “Although data will always be triple replicated in this mode, it may not be replicated across all datacenters.”
I think it's just saying that it's willing to place two of the three replicas in a datacenter, for example if one of the three datacenters is down. This has downsides, since losing a datacenter will make it aggressively fill up disks, but mitigates against subsequent failures causing data loss.
Most of the people who have run FoundationDB at scale have, for performance reasons, used configurations other than the "datacenter aware" mode for their inter region replication, so they may not be the strongest thing operationally.
There is some work that from what I can see in the code is still in progress to build a new, almost magical inter-region replication mode that I am very excited about, which combines synchronous replication to a "satellite" datacenter within region with asynchronous replication between regions and recovery logic that will finish replication and fail over in case of a partial failure of a region. You get fast transaction commits (much less than the inter region ping time), can fail over to a secondary region automatically and safely (without losing any committed transactions) in the vast majority of circumstances, and in the worst case you can (manually, because you are accepting data loss!) give up very recently committed transactions to fail over.
The satellite mode that I described is an active/passive mode. One region is accepting reads and writes; the other is just replicating everything. When it looks like the active region is in trouble, the asynchronous replication is "finished up" before switching over to the other region. The multiple datacenters in each region ensure that usually a regional failure will be "slow enough" that this automatic process (which after all only takes hundreds of milliseconds to seconds) can usually complete before a region goes away. And this will be handled pretty transparently by the datastore.
If a region is blown up instantly by an orbital laser cannon, then the database will go down and you will have to manually tell it to recover ACI in the other region, sacrificing the durability of whatever committed transactions in the lost region were destroyed by the laser cannon.
Well, if you want ACID then you are going to have to pay for at least one geographic round trip per committed transaction. (So why not go active/passive, and have at least one of your datacenters be fast?)
But what if you have different pieces of data and you want them to be fast in different datacenters? I think a great solution to this can be layered on top of multiple FoundationDB clusters, each using the satellite mode, but this is one thing that I at least haven't been able to think of a way to provide properly at the data model agnostic key/value store layer - the details about what to put where seem fundamentally dependent on your data model.
> So why not go active/passive, and have at least one of your datacenters be fast?
While local writes would stay fast, wouldn’t active/passive see higher-latency non-local writes than Spanner or Fauna’s (assuming a NAM-EUR-ASIA topology)?
I agree with and do appreciate the multiple FoundationDB clusters suggestion.
I'm speculating, but I think in this mode, from the "slow" datacenters you would see one round trip time to start a transaction, then reads will be fast (they can be done safely from your local datacenter because of MVCC), and then one round trip time to commit the transaction. I think that's as good as Spanner does with the same geography, but I'm not sure. I think you could get rid of the first round trip time even without any clock synchronization nonsense, by speculating on a read version for read/write transactions. And 1xRTT is obviously as fast as physically possible for ACID.
Update: Apparently Spanner is way slower than I thought in "slow" datacenters, doing a round trip for every transactional read. So this would absolutely stomp that. (Although spanner, as a higher level system, has the ability to move different data to be fast in different regions built in, which is nice, and as I said it will probably have to be a layer feature in the fdb world)
Wait how can it be ACID if it can tolerate N failures from N+1 nodes? Doesn't that kill consistency almost by definition? What level of isolation does it support? Snapshot? Serializable?
It is serializable and totally uncompromising. Philosophically pretty much everything defaults to the safest possible thing.
It can't tolerate N failures from N+1 nodes. It can tolerate N failures with N+1 copies of your data. In a big cluster you have plenty of nodes but storing everything 5 times to tolerate 2 failures is really expensive.
> It can't tolerate N failures from N+1 copies of your data
Sorry I got the terminology wrong, but that's a distinction without a difference. If it can tolerate N failures from N+1 copies, that means a network partition would allow any one copy to continue chugging along making changes by itself. You have two options: consistency is dropped and you downgrade to eventually consistent (at best), or availability is dropped meaning a single node can't make changes without a majority, which invalidates the N of N+1 failures claim. (Which is where the N failures of 2N+1 copies claim comes from in the first place: after N failures you still have a majority of copies.)
Or there's some other magic quorum protocol I've never heard of that makes the majority problem disappear.
FoundationDB stores 2N+1 copies of some "coordination state" and does a consensus algorithm whenever it is updated. But this state doesn't contain a copy of your data; basically think of it as storing a replication configuration. It's very small and rarely changes.
In the happy case, replication takes place using the replicas and quorum rules specified by this configuration. For example, you might require writes to succeed synchronously against all N+1 replicas of some transaction log. After N failures, there will still be 1 replica remaining with the latest transactions. But in order to proceed after any failures, you have to do a consensus transaction against a majority of replicas of the coordination state, to specify the new set of N+1 replicas you will be using. And you also make sure that the 1 replica you are recovering from knows you are doing it, so that it won't continue to accept writes under the old replication configuration.
There can't be two partitions capable of committing transactions, because (in this case) you need either
(a) All N+1 replicas of the log, so that you can commit synchronously, or
(b) A majority (N+1 out of 2N+1) of the replicas of the coordination state, AND 1 replica of the log
Sorry if this isn't a great explanation. Anyway it does work. I expect that you could rephrase this as an optimization of a consensus protocol, though I think it would be hard to build a performant and realistically featureful implementation that way.
When I wrote that I was wondering if it used a second 2N+1 dataset just for coordination & consensus. This has the benefit of separating data from consensus, allowing the N of N+1 data failure. But at the end of the day consistency still comes down to a N of 2N+1 failure tolerance of that second coordination state. It's smaller easier to replicate etc etc but it seems like it still has the same fault tolerance as just replicating the data 2N+1 times. It sounds like it's worked out great in practice for FDB.
But you say it rarely changes... but wouldn't it have to change every time there's a change to the dataset? I feel like this means you have to do even more replication and consensus than just replicating the data without this second consensus state.
You only have to write to the coordination state when there is a failure. You can commit millions of transactions in the happy case without ever doing such a write. And failure detector performance and other engineering concerns are usually more of a limitation, in practice, on the performance of recovery than the latency of the coordination state consensus, even when the coordinators are geographically distributed.
So the strategy is to optimistically assume that there are no failures and just replicate to all N+1 copies. If there's a failure then back off to the consensus state to coordinate the fix rigorously.
In the best case with no failures this works great. But as the number of failures increases, I feel like due to the extra synchronization there will be an inflection point where the cost of the extra layers of coordination will be higher than just synchronizing the data directly. But due to 'other concerns' that inflection point is pushed back by a lot.
If you expect to have lots of (hopefully very temporary!) node failures, I think FoundationDB has another trick up its sleeve. You can store (say) N+2 replicas of transaction logs, which are also relatively small and (since sequential) efficient. Then you have a write quorum of N+1 and a recovery quorum of 2 logs, and you don't have to do coordination on every failure.
It's certainly true that with enough failures you aren't going to make much progress. I'm not sure that is any less true with plain old state machine replication protocols, though.
The coordination consensus is (our own implementation of) disk paxos, which we liked for its operational properties in our context (the coordinators don't need to know about each other or communicate directly). An early version of fdb had a dependency on Zookeeper for this purpose; you can use anything.
Maybe not all data is replicated across all zones, so zones provide consensus but replicas provide data. Basically you have witness with enough information to provide consistency but without the full data store.
Spanner is geographically distributed and synchronizes over Google’s fiber. I’m not sure where FoundationDB falls on the spectrum, but there is “distributed” meaning runs on multiple machines, vs multiple availability zones (which could be in the same rough geographic area to minimize latency), vs multiple geographic regions, which is one of Google’s design goals for its distributed infrastructure nowadays: to be able to withstand a natural disaster or other disruption that takes a whole region offline, without missing a beat.
There’s still a lot to be said for the property, “scales horizontally to N machines,” of course, even if those machines have to be in the same data center.
It's more like what CockroachDB or TiDB use underneath. They all are suited for local clusters, but cannot perform well enough replicating across multiple far away datacenters over public internet, latency trade offs would be unbearable. Spanner is a bit different, with Google's fancy clocks and fancy networks, it can get better latency trade offs that might satisfy more applications.
Citus I can't remember, but if Citus does it properly wrt consistency, it should be in the same category as CockroachDB and TiDB, there is no magic.
See CockroachDB’s Living Without Atomic Clocks [0]. TrueTime [1] is a proprietary Google technology. Considering TrueTime upper bound uncertainty (< 10ms) [2], it may depend on Google’s network quality and breadth.
CockroachDB assumes you use NTP on your servers — but if you actually put GPS clocks into your servers, you could adjust the safety interval it assumes for clock drift down, and it'd behave pretty much the same as Google's.
I wish Apple just became a customer instead of buying and burying this for 3 years. Glad to see it come back but community momentum is hard to rebuild.
Since becoming open source and the system's architecture is in the open, I wonder how this criticism of FoundationDB from a VoltDB architect https://www.voltdb.com/blog/2015/04/01/foundationdbs-lesson-... fairs against the knowledge that has now become available. (To summarize, the author argues that building an SQL layer on top of an ordered key-value store is suboptimal)
Never heard about fdb but the concept is very intriguing. We know many nosql db on top kv store rocksdb or level db (like dgraph before moving to new storage engine). That it still needs more to write until it can be called distributed and scalable.
By using foundation db we can skip many parts and focus on other part like query language and API. That's why I like the layer concept, unfortunately the is very little documentation about it. Found some layer written in python in the repo, but I don't understand where is the position of layer in the general architecture. I thought it would like plugin but I think that's not the case.
A "layer" uses the fdb client much as it might use rocksdb. The layer can be a library embedded in your application, or a network service, it's up to you.
Wow this is super exciting. I had a half dozen things that I though FDB would be good for and then poof! it got sucked up into the Apple spaceship. Now to have it emerge unscathed is pretty awesome.
I wonder what the SSD engine performance would look with NVMe standard NAND or an Optane SSD instead of SATA. Any FoundationDB guys/gals on this thread able to comment?
Another Q: what's more commonly used in current FoundationDB deployments: memory engine or storage engine?
Nice documentation around the python API. At this point, docs state it supports 2.7-3.4. What scale of workload is required to catch it up to 3.6 compatability? https://apple.github.io/foundationdb/api-python.html
It is strange, why ArangoDB isn't mentioned often on HN, or in this thread. It is multi-model (KV, document, graph) database (with transactions) and I have been happily using it for a while now.
Haven't scaled it yet to any large installations, so I can speak about how well it does that.
Seems like a pretty comprehensive release in terms of OS and language support.
One oddity I see is that the ruby gem is not available on rubygems.org and therefore cannot be easily installed and maintained using the ruby package manager which is a bit of a pain.
Is foundationdb capable of providing a linearizable data store? As I remember from Martin Kleppman's books Serializable Snapshot Isolation is not linearizable because the snapshot does not include writes more recent than itself.
Yes. Linearizable means both serializable and externally consistent (or is sometimes used as just a synonym for the latter), and FDB has these properties with respect to transactions.
That would mean that SSI is not used to provide Serializable isolation. level. If so, what is used instead? 2 phase locking? I thought it's not very scalable ?
I guess textbook SSI is willing to "reorder" conflicting transactions if the result is still serializable, which could violate external consistency if you don't have any other bounds on the order. In the language of SSI, fdb simply aborts the later of any pair of read/write transactions with an rw-conflict, in accordance with a fixed ordering which is externally consistent.
I guess it could also be that your book uses an idiosyncratic definition of linearizable, like trying to apply it to individual operations within transactions, which might rule out any optimistic concurrency method. It might just be better to delete this word from your vocabulary in the database field because there is no wide agreement on what it means. The first two hits on Google for me are Wikipedia and Peter Bailis, and they give clearly conflicting definitions, though I think fdb satisfies both!
Thanks, I'd love to have a bit more of your attention, foundationdb seems very interesting but I need to know a bit more :)
Let me expand the definition in Kleppmann's book then.I think it is important because it creates a difference between SSI and typical Serializable level based on 2PL.
The below is paraphrasing the definitions on p. 324-329.
The book references http://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf.
(I must admit, I read the book, not the paper).
Basic idea - make a system appear as if there were only one copy of the data and ALL operations on it are atomic.
In this model, there may be replicas, but we don't care about them.
As soon as a client completes a write to the db, all clients reading the db must be able to see the value just written.
In SSI this is not true, because you may the snapshot may not include writes more recent than the snapshot -> reads from the snapshot are not lineraizable.
Linearizable CAS register is equivalent to consensus, and can provide total order. It is therefore what most developers would love to have (if cost was not an issue :) )
"A history is serializable if it is equivalent to one in which transactions appear to execute sequentially, i.e., without interleaving... A history is strictly serializable if the transactions’ order in the sequential history is compatible with their precedence order... Linearizability can be viewed as a special case of strict serializability where transactions are restricted to consist of a single operation applied to a single object."
In these terms, FoundationDB has the strict serializability property, and thus if you do exactly one operation in each FoundationDB transaction then that is linearizable.
But that kind of linearizability is much less powerful than what FoundationDB actually gives you. You cannot efficiently maintain global invariants, like indexes, with single-operational linearizability. I don't think this definition is very useful! I think strict serializability (which is to say serializability & external consistency) is what you actually want.
A linearizable CAS register can be implemented in FDB as simply as this:
Thank you very much for your in-depth explanation, I believe the only thing left for me is to run FDB myself, sounds very promising :) FDB replacing zookeper + sth else would reduce the complexity of target distributed system, almost too good to be true.
postgresql is not strictly serializable/externally consistent
for example this will commit under serializable:
create table counters(counter int);
insert into counters(counter) values(1);
BEGIN TRANSACTION ISOLATION LEVEL serializable;
select sum(counter) from counters;
/* insert sum into counters. wait until committing next transaction before executing the insert */
insert into counters(counter) values(1);
COMMIT;
/* this transaction should commit before doing the insert in the above transaction and after the above transaction has calculated the sum */
BEGIN TRANSACTION ISOLATION LEVEL serializable;
insert into counters(counter) values(10);
COMMIT;
both transactions commit and the final table looks like:
1, 10, 1
which is possible if the first transaction committed first, and then the second transaction committed. but it is possible for another client to see the table as: [1], [1, 10], [1, 1, 10] which is a sequence of states which should not be possible. if you see [1], [1, 10] then you should see [1, 10, 11] as the last state. hence it violates external consistency.
Well, benmmurphy isn't talking about serializability (some correct ordering exists), but strict serializabilty (roughly: at least one correct ordering corresponds to wall clock order). PG does have the former, but not the latter.
Huh? The example is for PostgreSQL's "serializable" isolation level, and that's what I'm confused about.
Serializability should ensure that the outcome is equivalent to some serial execution. What serial execution of those five transactions yields [1], [1, 10], [1, 1, 10]?
But also, I just read that prior to PostgreSQL 9.1 (released in 2011), the "serializable" isolation level was actually just snapshot isolation (now called "repeatable read"). So maybe that's what benmmurphy is referring to?
This is such a great news. I had seen, FoundationDB guys present in new england database summit couple of years ago. Their demo reminded me of sun microsystems famous demo of screwing their hard disk.
is Transaction authority in-memory or on disk?
this architecture seems kinda clunky. Wondering what's the performance and is this a good fit for quick k-v store use cases?
Nope, the actor compiler is down with mono and your need windows to compile the windows client. When we were independently working on it, we just ditched the windows code completely (we are a mac/linux shop)
Before FDB was acquired by Apple, a lot of the engineers used Visual Studio on Windows. It's a good IDE for C++. You certainly don't need it to develop though!
Windows was never the most important deployment platform for the product. I'm not sure how performant the server is. But it should work, and I would think the client is fine.
I think Apache Ignite distributed KV store is much better choice as it has 2PC, indexes, distributed computing idioms and can be embedded as library in JVM app. Plus it supports SQL engine, thanks to H2 SQL parser engine.
You can also create graph layer over it using gremlin in a day or two.
This seems nice. Besides a bunch of fanboy comments coming from the creators and devs, why is this exciting to the rest of us where things like the capability to join tables in an rdbms is trivial.
FDB is lower layer. It makes it possible to build a distributed rdbms on top of it. There are is some competition in this area, nothing remotely as mature as FDB.
Didn't downvote you, but I believe the excitement is over FoundationDB's ability to perform ACID compliant distributed transactions without sacrificing performance -which to my knowledge no current RDBMS or even NoSql can do.
There are several distributed storage systems like Ceph and they all have problems. Ceph is not good because it's an object storage system trying to provide block storage and a filesystem on top, which will never work well.
Absolutely right - I wrote an object storage backed filesystem for Hadoop and synchronizing was a complete nightmare.
It certainly worked well within our required architecture - but as a general purpose system it would have several issues over typical network topologies.
As active volume size increases and/or changes, the object storage layer latency can bump up to minutes, or even hours. We stopped tested at 100TB volumes, as the object storage layer was backed up to hours.
Object Storage is very convenient, but the lack of good metadata and latencies involved basically means that it's only good as an archival backing store of active data. If your active datasource goes out - you could lose hours of data unless you have local copies.
Not that this really has anything to do with FoundationDB, but why do you say that object storage is a poor substrate for file and block abstractions? There are many high-performance block and file systems built on object storage.
Block level is the lowest form of addressing bytes on devices. Filesystems are an abstraction on top of block devices. Object stores are an abstraction on filesystems.
Emulating a low-level layer on a higher-level abstraction (which itself is using this hierarchy) will never match the speed, scale, or reliability of doing it correctly.
Yes, you can architect a storage system this way. But
1) Even if you do, many, many, many high-performance systems are "on top" of a filesystem but don't actually use the filesystem for anything except perhaps as a block allocator. Consider databases.
2) Many, many object stores do not abstract on top of filesystems. Modern RADOS, the distributed object store, stores its local data in a local object store called BlueStore. BlueStore speaks directly to the block device; there's no filesystem involved.
3) Even if you did store part of your distributed object store data on top of a local filesystem, that's not necessarily an issue. HDFS does this. (HDFS, despite the "FS" in its name, is an object store as most practitioners understand them.)
1) Yes, databases are just object stores with indexing and querying, which are an abstraction on filesystems, which are abstractions on block devices.
2) Rados is an object store, which is an abstraction on BlueStore (effectively a filesystem and replacement of FileStore), which is an abstraction on block devices.
3) HDFS is an object store, which is an abstraction on filesystems, which are an abstraction on block devices.
I'm not sure what you're point is because you just restated what I already said. They are abstraction, and they work just fine without any performance issues because that is the trade off of having an abstraction.
What I also said is that emulating low-level layers on a higher-level interface (like a block device on top of a database or object store) will never match the original block device. What is untrue about this?
It appears that you want to make a very simple statement: "Abstractions tend to introduce overhead. If any software layers, an OS, or a network are added on top of a block device, I/O overhead will be introduced somewhere." As a conversational seed, I would wager most people would agree with that statement, in general.
One issue in this thread is that abstractions are concepts, not cpu instructions. In order to discuss overhead, one needs to reify the abstraction. For example, if you care about latency overhead, the block scheduler will definitely introduce overhead. But if you care about throughput, you probably /want/ abstractions like queues and schedulers.
> "What I also said is that emulating low-level layers on a higher-level interface (like a block device on top of a database or object store) will never match the original block device. What is untrue about this?"
Nothing is untrue about the sentiment of your statement. But from a practical standpoint, storage devices are useless pieces of junk without software. So to say abstractions slow down storage device while ignoring their utility feels arbitrary: why not talk about the length of the SATA cable, or the firmware in the disk controller?
If the answer is that you just wanted to make the simple statement like the one I quoted at the start of this post then that's great, I think we are all in agreement. Otherwise, it's not clear what your point is and many of the supporting examples that you list are stated as fact, but are in reality either generally untrue, or very nuanced points, both of which tend to attract strong opinions :)
> Block level is the lowest form of addressing bytes on devices. Filesystems are an abstraction on top of block devices. Object stores are an abstraction on filesystems.
I don't agree with this, but I think you may be confused because "Object Storage" can mean several different things.
"Object Store" in Ceph (as in RADOS - Reliable Autonomous Distributed Object Store) basically means key-value store. I typically say "blob store" instead to avoid the confusion with more sophisticated systems. It is exposed through a S3-like API. As far as I know, this layer of CEPH is pretty good, and you need a layer like this in most distributed systems anyway.
Ceph provides something called RBD, RADOS Block Device, which exposes a Block Device interface and is implemented on top of RADOS blob storage. It is useful for VM disks and has decent performance because it makes heavy use of the cache.
Some people use filesystems on top of RBD, but as far as I know CephFS itself does not sit on top of RBD. It is not as widely used as RBD because it is pretty recent (first release in 2016). The data is stored in RADOS and the metadata (which is the hardest part in a distributed filesystem) is dealt with by a Metadata Server cluster (MDS). This sounds like a typical distributed filesystem architecture to me, similar to GFS (the MDS replaces the GFS master and RADOS is used instead of chunk servers).
People tend to have a lot of issues with Ceph, but I think this is because:
1) It is used in reasonably large scale production settings where you are going to have issues anyway ;
2) It is not as easy to understand and fine-tune as it should be ;
3) Some people expect it to solve all their issues magically with perfect performance...
4) Some people use filesystems on top of RBD when they should have used CephFS or even direct interfaces to RADOS when possible.
But in general, I think Ceph is an example of a decently architectured complex distributed system.
Just for future reference, RADOS is actually not very S3-like. It is an object store; it does map from object names to buckets of bytes. But unlike S3 and many similar object stores or key-value DBs, RADOS allows you to do file-like operations: you can append, write to random offsets in the object, overwrite pieces of it but not the whole object, etc. (That's all in addition to some stunningly-complex stuff like injecting custom code to do specific kinds of transactional read-writes on the OSD [storage node] itself.)
That's all key to RBD being useful, or indeed CephFS itself. There are systems that map a filesystem layer on top of S3, but they have trouble because there aren't good ways to overwrite random small pieces of an S3 object. With RADOS, there are! :)
Sure, and key/value systems are at the similar level of object stores, meaning they are abstractions on filesystems (which are abstractions on block devices). This is the hierarchy.
Using Ceph for block and file access is like using AWS S3 to emulate block devices and filesystems. It'll work, and there is software for it, but it will never be very good. And Ceph is far from S3.
As a general rule of thumb, I would definitely agree that increasing the number of abstraction layers should be done with consideration to performance concerns.
Interestingly, in the latest version of Ceph the abstractions are a bit different than you listed. Ceph is now using an object store built directly on top of raw devices. It's the file system, block, and object abstractions that exist on top of that.
I mostly agree, but if you want a distributed, reliable, scalable block store it is pretty easy to build one on FDB. Here is a super simple one that acts as an NBClient server that you can mount from Linux:
To the best of my knowledge (n.b.: I am not an expert, though I follow this field) there still aren't any direct competitors to the breadth of what FoundationDB can do and do well.
The speed and durability has gone unmatched. We can write millions of transactions per second w/ millions of reads w/o issue. We have never lost data or found data to be inconsistent.
Spanner can do global consistency and (some?) transactions but I'm unaware of it being able to do the sort of layering Foundation can internally to expose largely different database forms on top of it. I have never used Spanner, though, so I'm open to being corrected.
What do you mean by “forms”? Spanner is also layered. The bottom layer is basically a key-value store. On top of that there’s a full blown SQL layer, which, BTW can work with hierarchical records as well as flat tables. Both support transactions and guarantee global consistency.
Apple builds their OS with a lot of software from FreeBSD, but when they opensource FoundationDB they don't provide a distribution that will work on it. I know the license says Apple doesn't have to do anything, but it just seems wrong that they didn't provide a download.
Based on your earlier comment I guess you think that Apple owes the FreeBSD community a version of FoundationDB that will work with it because apple uses FreeBSD technology.
I make no comment on the validity of the stance but I think you are probably in the minority.
However since it is now open source at least it is possible for someone to do the work to get it on FreeBSD at least.
The short version is that FDB is a massively scalable and fast transactional distributed database with some of the best testing and fault-tolerance on earth[1]. It’s in widespread production use at Apple and several other major companies.
But the really interesting part is that it provides an extremely efficient and low-level interface for any other system that needs to scalably store consistent state. At FoundationDB (the company) our initial push was to use this to write multiple different database frontends with different data models and query languages (a SQL database, a document database, etc.) which all stored their data in the same underlying system. A customer could then pick whichever one they wanted, or even pick a bunch of them and only have to worry about operating one distributed stateful thing.
But if anything, that’s too modest a vision! It’s trivial to implement the Zookeeper API on top of FoundationDB, so there’s another thing you don’t have to run. How about metadata storage for a distributed filesystem? Perfect use case. How about distributed task queues? Bring it on. How about replacing your Lucene/ElasticSearch index with something that actually scales and works? Great idea!
And this is why this move is actually genius for Apple too. There are a hundred such layers that could be written, SHOULD be written. But Apple is a focused company, and there’s no reason they should write them all themselves. Each one that the community produces, however, will help Apple to further leverage their investment in FoundationDB. It’s really smart.
I could talk about this system for ages, and am happy to answer questions in this thread. But for now, HUGE congratulations to the FDB team at Apple and HUGE thanks to the executives and other stakeholders who made this happen.
Now I’m going to go think about what layers I want to build…
[1] Yes, yes, we ran Jepsen on it ourselves and found no problems. In fact, our everyday testing was way more brutal than Jepsen, I gave a talk about it here: https://www.youtube.com/watch?v=4fFDFbi3toc