Distributed Databases Should Work More Like CDNs

astine · on Feb 7, 2018

Sooo, being able to distribute data globally is good for performance? Who knew? The thing about distributed systems, including distributed databases is that they need to navigate around the the CAP theorem(Consistency, Availability, Partition Tolerance, pick two, essentially,) and every solution is ultimately a trade off. This article would be a lot more interesting if it showed how CockroachDB made a better trade off than the other solutions listed.

Chickenosaurus · on Feb 7, 2018

I think the PACELC theorem [1] should be preferred to the CAP theorem as it is more precise. It states in case of a partition (P), there is a trade-off between availability (A) and consistency (C). Else (E), there is a trade-off between latency (L) and consistency (C).

[1] https://en.m.wikipedia.org/wiki/PACELC_theorem

tormeh · on Feb 7, 2018

Ah, much better, I have to say. Much more informative as well. The CAP theorem is a bit ungainly since the only way to have a CA database is to only have a single node.

ithkuil · on Feb 7, 2018

> the only way to have a CA database is to only have a single node.

I know that this is meant to mean: in the real world you cannot just write off partition resilience and still call your system highly available, since partitions will happen sooner or later and when they do your CA system won't be available.

But in the other hand, having a system that is always available _except_ during a network partition is a useful thing: you can design a network where partitions happen much rarely that the rate at which individual machines die.

I.e. in practice a single node, while if you nitpick is the only true CA, will available for less time in average than a multi node CA system which if you nitpick is not CA (provided that the underlying network is partition resilient; it's not a boolean, it's a probability)

(See Google spanner)

zjaffee · on Feb 8, 2018

It's my understanding that Google Spanner is leveraging hardware (direct lines between machines), so the degree at which there is a network partition is different than compared to other systems.

Chickenosaurus · on Feb 8, 2018

According to a paper by Eric Brewer, the author of the original CAP theorem, Google Spanner technically is not a CA system [1]. It is advertised as such because partitons are supposed to be exceedingly rare as the infrastructure is outstanding and a lot of operating experience is present [1].

I believe at the end of the day the CAP theorem is too fuzzy for such discussions.

[1] https://research.google.com/pubs/pub45855.html

b4lancesh33t · on Feb 8, 2018

Theorems are all well and good, but ultimately the thing that matters is the experience of using the system "IRL" as the kids say. If spanner is up 99.99% of the time, and is consistent at all times, it is probably not going to be the weak link in your software chain.

(Fwiw their slo for multi-regional instances is 99.999%, although I have no idea what their measured performance is against that objective.)

ithkuil · on Feb 8, 2018

The paper says:

"Does this mean that Spanner is a CA system as defined by CAP? The short answer is “no” technically, but “yes” in effect and its users can and do assume CA.

The purist answer is “no” because partitions can happen and in fact have happened at Google, and during (some) partitions, Spanner chooses C and forfeits A. It is technically a CP system."

which I believe is another way to word what I say in my post above.

Chickenosaurus · on Feb 8, 2018

I agree. I did not mean to refute your point, but to provide more context.

AYBABTME · on Feb 7, 2018

The article they link at the end explains this: https://www.cockroachlabs.com/docs/stable/transactions.html

astine · on Feb 7, 2018

Thanks, I got the answer to questions I was asking from CDB FAQ, actually. I just wish this article in particular was less fluffy.

marknadal · on Feb 7, 2018

Also a distributed database engineer, and while I disagree with CockroachDB's CAP Theorem tradeoff decisions, they are definitely well reasoned and principled.

RethinkDB was Master-Slave (strongly consistent) with an amazing developer community and actually survived Aphyr's tests better than most systems.

CockroachDB is also in the Master-Slave (strongly consistent) camp, but more enterprise focused, and therefore will probably not fail, however their Aphyr report worked... but was unfortunately very slow. But hey, being strongly consistent is hard and correctness is a necessary tradeoff from performance.

Other systems, like Cassandra, us (https://github.com/amark/gun), Couch/Pouch, etc. are all in the Master-Master camp, and thus AP not CP. Our argument is that while P holds, realtime sync with Strong Eventual Consistency is good enough yet has all the performance benefits, for everything except for banking. Sure, use Rethink/Cockroach for banking (heck, better yet, Postgres!), but for fun-and-games you can do banking on top of AP systems if you use a CRDT or blockchain (although that kills performance, CRDTs don't) on top.

So yeah, I agree with you about CAP Theorem and stuff, disagree with Cockroach's particular choice - but they do have some pretty great detailed explainers/documentation on their view, and therefore should be treated seriously and not written off.

elvinyung · on Feb 7, 2018

Master-master doesn't automatically imply AP. e.g. EPaxos [1] is basically a master-master CP system.

[1] https://www.cs.cmu.edu/~dga/papers/epaxos-sosp2013.pdf

manigandham · on Feb 8, 2018

Banking is eventually consistent and the most incorrectly used example when it comes to any database scenarios.

CRDB is also master-master, all the nodes are the same and can serve reads and writes. That has no bearing on AP or CP.

skybrian · on Feb 7, 2018

From a bit of reading, it looks like they do it with optimistic concurrency, probably at the expense of write throughput. When there is contention (writing the same keys) there may be a lot of retries.

It seems like a good solution for data that doesn't change too quickly?

philip1209 · on Feb 7, 2018

When I think of distributed databases, I think of Bitcoin - and that's not fast :-)

dantiberian · on Feb 8, 2018

I love CockroachDB, but feel like this article was a bit misleading in it's claims. I started out writing a comment response, but it got so long that it turned into a blog post: https://danielcompton.net/2018/02/08/add-context-cockroachdb....

cagenut · on Feb 7, 2018

this is true, and why fastly is kindof like a globally distributed nearly-cache-coherent key value store that people use as a CDN.

there's a great talk on how its done with a custom gossip protocol implementation: https://www.youtube.com/watch?v=HfO_6bKsy_g

loiselleatwork · on Feb 7, 2018

This is super interesting; thanks for the pointer.

dstroot · on Feb 7, 2018

CDN: no trade offs. Faster everywhere. More reliable overall

Cockroach DB: trade some performance for geographic redundancy. The trade off may work in your favor - e.g. read heavy workloads (or not).

I plugged in CDB I place of Postgres for some testing this week, was surprised it worked so well.

icebraining · on Feb 7, 2018

no trade offs

This is almost never the case, and CDNs are no exception. A CDN like Cloudflare that reuses your domain(s) means that they become just a useless point that your dynamic requests have to travel to and from the main server. A CDN that uses its own domains requires extra DNS queries, extra TCP & SSL connection setup, etc, plus it only starts loading when the browser has started processing the HTML.

dstroot · on Feb 7, 2018

Excellent points. I definitely oversimplified.

ec109685 · on Feb 8, 2018

CDN’s can terminate a user’s ssl connection close to the user (and keep a persistent one open to the origin), so it is more than a useless hop.

There are also http headers that can instruct the browser to fetch cdn resource before the html is delivered.

dasil003 · on Feb 7, 2018

At my previous company we built our own CDN because we were streaming video to a long tail of global users for a limited library. The problem with off-the-shelf CDNs was they couldn't keep our content cached for the long tail, but for the same price we could replicate our limited library to bare metal. Traditional caching could then go on top of that.

This was a huge win for QoS on cache misses, which were a significant portion of our traffic. There are tons of tradeoffs to make this happen which is why we couldn't find an off-the-shelf solution to deliver the same results.

zzzcpan · on Feb 7, 2018

Unless you build your own DNS routed CDN, it actually reduces reliability. And to make it faster everywhere you trade immediate consistency for eventual consistency.

joshribakoff · on Feb 8, 2018

I think what you're trying to say is the CDN becomes another point of failure, that is, unless you have DNS failover which updates the DNS of cdn.example.com to point to your origin server's IP instead of the CDN in the event the CDN goes down. You should have monitoring & fail saves in place for your CDN, and if that's in place a CDN has higher redundancy vs just a single server is what the OP was trying to say, I bet.

zzzcpan · on Feb 7, 2018

> When partitions heal, you might have to make ugly decisions: which version of your customer’s data to you choose to discard? If two partitions received updates, it’s a lose-lose situation.

When partitions heal you simply merge all versions through conflict-free replicated data types. No ugly decisions, no sacrificing neither latency nor consistency. We call it strong eventual consistency [1] nowadays. And it's exactly like CDNs, except more reliable.

I'm wondering, since CockroachDB keeps lying about and attacking eventual consistency in these PR posts, the whole "consistency is worth sacrificing latency" mantra might not work in practice after all. People just don't buy it, they want low latency, they want something like CDNs, something fast and reliable, something that just works. Something that CockroachDB can never deliver.

[1] https://en.wikipedia.org/wiki/Eventual_consistency#Strong_ev...

imtringued · on Feb 7, 2018

My application is just plain old CRUD.

What if two users want to change e.g. the telephone number of an existing record during a network partition.

There just is no obvious way to merge a telephone number. One of them is correct, the other is incorrect.

Can CRDTs solve my simple problem?

al2o3cr · on Feb 8, 2018

I've only tinkered with CRDTs, but one common theme in those applications is changing the data structure to accommodate the CRDT.

In your example, that might mean changing the single "phone number" field to many "phone numbers", so that merging the two writes results in a customer record with two phone numbers. This preserves the data, but pushes conflict resolution (which number should be used?) out into the consumers.

merb · on Feb 7, 2018

I think the issue is solved (while still complex) with an event sourcing system. The chances are really high that the users did not push / updated the number at the exact same time. (so they are at least 1ns apart, even when not it's not that bad). So you can actually restore to a sane state by reapplying the log from scratch.

i.e. this is solved with CQRS and Event Sourcing and it probably works in like all databases. It's quite complex but pretty reliable, I'm pretty sure that everybody already built at least a extremly simple append only event log.

lukev · on Feb 8, 2018

1ns according to whom? Clock skew is still a thing :)

elvinyung · on Feb 7, 2018

Does this imply that there's a single place that log entries get ingested at? Doesn't that make it a single point of failure?

cube2222 · on Feb 7, 2018

No, you can just have partitioned replicated topics like kafka has, just make sure that all the events for one user get put into the same partition, so that they get totally ordered.

elvinyung · on Feb 8, 2018

So I guess the specific issue that grandparent raised was if multiple users wanted to edit the same record. So even if the queue is sharded on some primary key, then for each partition, either a) you only have one machine ingesting and thus a SPOF, or b) you ingest data using multiple machines, which means you need consensus/conflict resolution somewhere, right?

Joeri · on Feb 8, 2018

You ingest the log partition in all machines to reconstruct the state from the log. You do this even on the machine that writes to the log, treating the log as the source of truth. Since every user would belong to one partition there would be a globally agreed ordering of their events in the log.

Basically though that makes a CP tradeoff, since a network partition that hides the primary kafka replica from some of the writers causes writes to fail.

There’s simply no way to have a globally distributed system that’s always available and always consistent, either you get inconsistency or you drop writes.

elvinyung · on Feb 8, 2018

Yep, this was the answer I was expecting. Basically, I really doubt that event sourcing magically solves this, which is what the above comment claimed [1].

[1] https://news.ycombinator.com/item?id=16328127

merb · on Feb 7, 2018

well you can either use a single database for log entries, or cassandra or a mysql cluster or a pg cluster. that doesn't matter.. basially instead of having a table where you update/delete entries you basically append only to a table. so insert only. you can still have another table that will be aggregated from your log table.

where you store (i.e. which database system, cluster whatever) your stuff doesn't matter.

elvinyung · on Feb 8, 2018

That doesn't answer my question. If you're appending to the end of a table, regardless of system you still need one machine somewhere to ingest new records for the same primary key, right?

xj9 · on Feb 7, 2018

sort of

all CRDT will give you a deterministic outcome and provide enough information to allow a user to decide if this outcome is acceptable

CMCDragonkai · on Feb 8, 2018

Last write wins in that case?

riku_iki · on Feb 7, 2018

> they want something like CDNs, something fast and reliable, something that just works.

There are other DB systems with such properties "they" can use, e.g. Cassandra. But CockroachDB also gives ACID transactions, which may be important for others.

pdeva1 · on Feb 7, 2018

Am I wrong, or does seems the article not really tell how CDB deals with the latency issue, especially with regards to writes?

If the write has to be consistent and available across multiple regions, it will need to synchronously replicate that write to all the regions, thus incurring the same performance penalty as RDS or any other consistent database.

irfansharif · on Feb 7, 2018

I doubt this particular article was intended to address how CRDB handles writes cross-region, synchronous replication, etc. You'll find articles touching on varying aspects of what you're looking for if you dig through some of the earlier posts on their blog[1] or their FAQ[2].

[1]: https://www.cockroachlabs.com/blog

[2]: https://www.cockroachlabs.com/docs/stable/frequently-asked-q...

pdeva1 · on Feb 7, 2018

well the comparison section to RDS specifically claims that RDS is inferiors because "this forces all writes to travel to the primary copy of your data". So it doesn't explain how CDB is superior to RDS, since writes will incur the same penalty in CDB too.

irfansharif · on Feb 7, 2018

At the key-value level, CockroachDB starts off with a single, empty range (a set of sorted, contiguous data from your cluster). As you put data in, this single range eventually reaches a threshold size (64MB by default). When that happens, the data splits into two ranges, each again covering a contiguous segment of the entire key-value space. This process continues indefinitely; as new data flows in, existing ranges continue to split into new ranges, aiming to keep a relatively small and consistent range size. Each range is replicated 3-way (by default) as well, and is backed by a single Raft instance.

When your cluster spans multiple nodes (physical machines, virtual machines, or containers), newly split ranges (or more specifically, replicas of these ranges) are automatically rebalanced to nodes with more capacity. Writes addressed to a range are handled by the Raft leader for that range (which can hop around its various replicas as needed). Writes to different ranges (non-overlapping key spaces by definition) are processed independently, and very well may be processed across multiple machines.

Source: https://www.cockroachlabs.com/docs/stable/frequently-asked-q...

redwood · on Feb 8, 2018

How are ranges with implicit leader-region association associated with appropriate writers near to respective region? Seems like you must range by location for this to work

pdeva1 · on Feb 7, 2018

what you described is the mechanism to deal with scalability, ie throughput. What is being disputed is the claim to latency. Every bit of data needs to be replicated across multiple regions consistently and that will incur the same latency as rds or any other consistent database

irfansharif · on Feb 7, 2018

Given the flexibility of where the range raft leader could be, CRDB makes an active effort to colocate it near to where the requests originate from (which is some part of what the CDN parallel was alluding to with low RTT for multi-region deployments).

WRT to the writes here, if a majority of the replicas for that range are in the proximate regions, the requests would only travel that far before responding. I believe the argument is that this a more flexible design point than a single point of entry for all incoming writes, regardless of the origin. The cost to write out to the furthest region within any majority of replicas is of course inevitable to have cross-region durability, alternatively you could trade this off to have the majority of your replicas specific to requests from a specific region, be located to that specific region.

pdeva1 · on Feb 8, 2018

> The cost to write out to the furthest region within any majority of replicas is of course inevitable to have cross-region durability

So we are in agreement that CDB has same write latency as rds for multi region deployments. The article seems to imply that is not the case, but as you yourself agree it actually is.

irfansharif · on Feb 8, 2018

No? Apologies if I'm not able to communicate as clearly. The RTT would be to the furthest replica within `n` replicas ordered by proximity to the query origin, where `n` is the number it would take to have a majority within the specified replication factor (so 2 for a replication factor of 3, 3 for 5, etc.). It would be different from the RTT observed with Amazon RDS as long as the furthest replica, as described above, is in a different zone/region/DC as compared to RDS's primary instance (assuming a similar topology). So with a replication factor of 3, if your DCs are in the UK, Australia and the US, if the RDS primary was in the US, all write requests from Australia would have to travel over to the US. But with a replica in each zone, write requests from Australia would hit the DC in Australia, and the DC in UK, avoiding the larger RTT to the US.

To reiterate: I believe the argument is that this a more flexible design point than a single point of entry for all incoming writes, regardless of the origin.

daxfohl · on Feb 8, 2018

Does GDPR really require all Europeans' data to stay on EU soil without explicit consent?

itcmcgrath · on Feb 8, 2018

Seems to be a common misconception. Look at EU Model Contract Clauses: e.g. https://cloud.google.com/terms/eu-model-contract-clause

dantiberian · on Feb 8, 2018

I don't think so, you need consent to process personal data anywhere in the world, and you can only transfer it outside of the EU if there are appropriate safeguards - https://gdpr-info.eu/art-46-gdpr/

jjevanoorschot · on Feb 7, 2018

I used CockroachDB for a university project, and while I think it looks very promising, I found the tooling and documentation to be a bit lacking. I wouldn't use it in production yet. However, when CockroachDB matures a bit I can really see it take off.

timkpaine · on Feb 7, 2018

Imagine a globally replicated and version controlled object store. You could use this for data, assets, source code, anything you like. Would be incredibly useful for both the development side of things, as well as production.

rhencke · on Feb 7, 2018

Like IPFS?

https://github.com/ipfs/ipfs

CMCDragonkai · on Feb 8, 2018

Only if others replicate your content. At the moment it's totally voluntary. However filecoin may change that.

prepend · on Feb 7, 2018

We could call it the World Wide Web.

TeMPOraL · on Feb 7, 2018

It would fall short of all of these goals, as it would get turned into advertisement, colorful magazine pseudocontent and malware distribution platform.

blowski · on Feb 7, 2018

Don’t we now have both things, with the latter subsidising the former?

TeMPOraL · on Feb 7, 2018

Yes. Come to think of it, it's not that the original web died - it just didn't grow much; almost the entire growth went to the commercial part.

cookiecaper · on Feb 8, 2018

Yes. I've thought about a "distributed web" and mocked up some prototypes for a while. At some point, it loops back around and you come to something that is not hugely different from what we already have.

While some of the technical underpinnings could be improved, the state of the internet is determined by social and legal considerations, not technical ones.

rakoo · on Feb 7, 2018

https://docs.datproject.org/#

ddorian43 · on Feb 7, 2018

Why not store the data only where you live ? Worst case meteor strikes and you die with your data.

saghm · on Feb 7, 2018

Sometimes people travel, and it would be nice to have the data if a meteor strikes while you're gone.

eip · on Feb 7, 2018

>Imagine a globally replicated and version controlled object store.

S3?

0xCMP · on Feb 7, 2018

You can replicate it to other regions, but it's not assumable that it's replicated by default. You can enable replication to other region's buckets though.

dman · on Feb 7, 2018

Timdb ?

supergirl · on Feb 8, 2018

what a dumb article about nothing. hey, a database replicates stuff but so does a CDN! what a brilliant insight. let's make an article with this title but fill it with marketing text.

hexspeaker · on Feb 7, 2018

If this interests you, you'll probably enjoy reading Google's paper on Spanner. Cockroachdb was heavily influenced by it.

https://research.google.com/archive/spanner.html

philbogle · on Feb 7, 2018

As long as we're advertising, consider: https://cloud.google.com/spanner/

russell_h · on Feb 7, 2018

I'm curious, what sort of read latency is achievable with CockroachDB? Does it support some notion of tunable read consistency in order to achieve lower read latency at the expense of consistency?

susegado · on Feb 8, 2018

Reads from CockroachDB go through a lease holder for the piece of data being read without needing confirmation from any replicas about consistency, so there is no overhead from replication. But read latency can be affected by writes on the data (conflicts), because it is fundamentally a consistent system (serializable). This is not tunable.

russell_h · on Feb 8, 2018

Thanks, that makes sense (and provides good context for the article). So for a cluster spanning multiple regions, one region can support low-latency reads for a given range, but reads in any other region will have to go cross-region to the leaseholder. Being able to move the leaseholder around to optimize read latency makes a lot of sense.

It would also be useful to me, in some cases, to be able to perform a read-only query in an "inconsistent" mode to avoid that cross-region latency, at the expense of potentially receiving stale data.

appdrag · on Feb 7, 2018

Very interesting! Have you used it at big scale for production or not yet?

HumanDrivenDev · on Feb 7, 2018

The only distributed database I'm familiar with is CouchDB. Can anyone give me a birds eye view on how Cockroach is different?

tyingq · on Feb 8, 2018

Unfortunately, it's not on the list for the table of feature comparisons, but here's a start: https://www.cockroachlabs.com/docs/stable/cockroachdb-in-com...

Might be helpful if you already know the couchdb answer for each feature.

anon1253 · on Feb 8, 2018

so immutable, like datomic? https://www.infoq.com/articles/Datomic-Information-Model

Or more like federated SPARQL on RDF? https://www.w3.org/TR/sparql11-federated-query/

yuz · on Feb 7, 2018

ddorian43 · on Feb 7, 2018

CDN post with no performance talk (beside keyword).

Never mention lower performance (even on single-node). Add to that aws-vps with pseudo-cores and spectre-upgrade and good luck with your tps-reports.

vog · on Feb 7, 2018

Would you mind to elaborate? Your criticism is so condensed that I'm unable to make a lot of sense of it.

ddorian43 · on Feb 9, 2018

Their performance is 0.1-0.03 of postgres. And spectre makes your aws-vps ~0.7x compared to previously. Meaning you can't use it for performance-sensitive stuff (the whole point of fancy sharding and (no/new)sql).

vog · on Feb 9, 2018

Thanks! That's something I can make sense of.

And I agree that this is perhaps one of the many situations where people throw away the "C" of "ACID" for no reason beyond it is modern to do so. (At least most strive for "Eventual Consistency", but that's another can of worms.)

This is even less understandable once you notice that PostgreSQL offers a lot more features than most other databases (SQL or NoSQL) and is extremely flexible and extensible - even if you use it just as a fancy JSON or XML store.