Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I really like the fact that the CockroachDB team recently did a detailed Jepsen test with Aphyr. The follow up articles from both CockroachDB and Aphyr explaining the findings are very interesting to read. For those who might be interested -

https://www.cockroachlabs.com/blog/cockroachdb-beta-passes-j...

https://jepsen.io/analyses/cockroachdb-beta-20160829



> CockroachDB is a distributed, scale-out SQL database which relies on hybrid logical clocks

I was curious what "hybrid logical clocks" meant and found the linked paper a bit over my head. I found this more layman description:

http://muratbuffalo.blogspot.ca/2014/07/hybrid-logical-clock...

Apparently Google used GPS/atomic clocks to keep time synced:

>> To alleviate the problems of large ε, Google's TrueTime (TT) employs GPS/atomic clocks to achieve tight-synchronization (ε=6ms), however the cost of adding the required support infrastructure can be prohibitive and ε=6ms is still a non-negligible time.

And CockroachDB created more of a hybrid version that works on commodity hardware.

Distributed systems programming sounds endlessly challenging as you are always balancing trade-offs.


You might find our post[1] on atomic clocks, rather having to do without them, partially interesting.

[1]: https://www.cockroachlabs.com/blog/living-without-atomic-clo...


Hey guys, I'm a fellow developer of distributed systems here.

First of all I think what you are doing is great.

My question is what's the point of clocks at all? The current time is a very subjective matter and I'm sure you know this, the only real time is at the point when the cluster receives the request to commit. Anything else should be considered hearsay.

Specifically the time source of any client is totally meaningless since as you say further in the discussion that client machine times can be off by huge margins.

If you accept that then one has to accept the fact that individual machines within the cluster itself are prone to drift too, although one can attempt to correct for that I appreciate.

Wouldn't you think though that what is more important is that the order is more based on the bucketed time of arrival (with respect to the cluster).

I don't see how given network delays anyone can be totally sure A is prior to B, atomic clocks or not.

What is important is first to commit.

[edit] Yes would love to talk privately about this topic @irfansharif


When a single system is receiving messages, you pick an observed order of events that meets some definition of fairness, and you stick with it all the way through a transaction. By pretending A happens before B (even if you're not entirely sure) you can return a self-consistent result. And once you have that you can simplify a lot of engineering and make a lot of optimizations, so that the requests aren't just reliable but also timely.

You throw three more observers in and how do you make sure that all of them observe the requests arriving in the same order? Not even the hardware can guarantee that packets arrive at 4 places in the same order, even if the hardware is arranged in a symmetrical fashion (which takes half the fun out of a clustered solution).


My question is what's the point of clocks at all?

I would highly recommend to read the link by irfansharif. It's probably the best primer ever written on the subject.


Yes, I really enjoyed it!


> Specifically the time source of any client is totally meaningless since as you say further in the discussion that client machine times can be off by huge margins.

distributed systems like cockroach shouldn't use the client's conception of current time for anything at all, except possibly to store it (_verbatim_, don't interpret it) and relay it back to the client or to other clients (and let the client interpret it however they want).


Hmm, I'm not sure I completely understand your question or your source of confusion here but unless I'm grossly misunderstanding what you're stating I think we might be conflating a couple of different subjects here. I'm happy to discuss this further over e-mail (up on my profile now) to clear up any doubts on the matter (to the best of my limited knowledge).


Why not simply have the cluster sync a time between themselves? First node in the cluster gets the time, and as the new nodes come online they set their own internal time via the cluster? So in a world where there is not NTP or atomic clocks the system could continue to operate.


This doesn't take into account when clocks on different systems run at different speeds, or when clocks jump, especially on VMs and cloud instances, which happens all the time.


I don't really get why you would build a distributed database with dependency on wall time (unless you're Google and can stick atomic clock HW on every node). Why not use vector clocks? Am I missing something?


the section on lock-free distributed transactions on our design document[1] should answer your question, specifically the sub-section on hybrid logical clocks.

[1]: https://github.com/cockroachdb/cockroach/blob/master/docs/de...


Thanks! Interesting. http://www.cse.buffalo.edu/tech-reports/2014-04.pdf is the relevant paper on hybrid logical clocks, linked in the faq.


It may be a nitpick, but Google don't stick atomic clocks or even just GPS clocks into every node. Just into every data center. The difference means that it's actually perfectly feasible to do that for very many other companies running DCs or just in colos. The big news was how they used the fact (that times are synchronised with an upper limit to how far the clocks in two nodes will diverge) as a very significant optimization in Spanner, one of their distributed databases.

Building a distributed database that can optionally benefit from the same optimization actually makes a great deal of sense. Your average hobbyist won't care, but spending some extra few kilo bucks on hardware in a dc and get big throughput improvements out of your database system is a steal.


The real definition of distributed systems is endlessly challenging as you are always balancing trade-offs.

The CAP theorem still holds, so we pick which 2 out 3 to be strength​s and where to compromise as little as possible. It's a guaranteed 87.3% effective hair loss formula. I find Quiet Riot helps.


> When a node exceeds the clock offset threshold, it will automatically shut down to prevent anomalies.

If you're planning to run on VMware, be prepared to handle rather dramatic system clock shifts. I've seen shifts of up to 5 minutes during heavy backup windows. Not all customers might be willing to have their nodes go down due to system clock / NTP issues.


employee here.

Yep, we've also had our share of troubles with noisy clock on cloud environments, so that's something we're very aware of. Further down the road, we're considering a "clockless" mode, which of course isn't clockless, but depends less on the offset threshold: https://github.com/cockroachdb/cockroach/issues/14093

That said, even today, configuring a cluster with a fairly high maximum clock offset is feasible for many workloads.


Do people not run NTP on their VMs?

Or are you saying that you see heavy clock skew despite having NTP in place?


The latter. NTP only checks and corrects clock offsets every so often. If the "hardware"[1] clock undergoes offset shifts at random times because of VM pauses this won't get fixed immediately until the next NTP sync.

This gets exacerbated in cloud settings where VMs get moved between physical machines, or racks since now it's not just the pause, its that the clock is now pointing to a new hardware time source. [1] in quotes since it's viewed as a single piece of hardware to the software inside the VM.


Cassandra user here in AWS. Clock drift is a big problem on VMs. NTP is not aggressive enough in these environments to keep clocks relatively in sync. We regularly had several hundred milli drifts between nodes. As cassandra is extremely clock sensitive, this is a big problem. We ended up using chrony with very aggressive settings to keep things in the sub-ms range for the most part. But it's still possible to get "hiccups" where time will skip. Especially if you reboot a VM.

Vanilla ntp makes assumptions about the hardware clock (that drift is stable) that don't apply to virtualised clocks. Using tsc clocksource may help as well.


Interesting. I wonder if anyone has documented any best practices for timekeeping in VMs.

VMware has this but it does not appear to have been updated in a while. https://kb.vmware.com/selfservice/microsites/search.do?langu...


I run a lot of VMware:

* Set the esxis to have five external sources

* Search fwenable-ntpd (https://www.v-front.de/2012/01/howto-use-esxi-5-as-ntp-serve...) and download the .vib (do a security audit on it - its a zip file I think - to ensure it is what you think it is). Install the .vib which simple adds a ntp daemon option to the firewall ports. This works on v6.5

* Run ntpd on Linux VMs, pointed at the hosts with the local clock fudge as a fallback

* For Windows VMs in a domain, set the AD DC with PDC emulator role to sync its clock to the host via the VM guest tools, leave the rest alone

* On your monitoring system make sure that it has an independent list of five sources and use plugins like ntp-peer for ntpds and ntp-time for Windows (Nagios/Icinga etc)

With the above recipe, ntpq -p <host> shows offsets less than 1 ms across the board for ntpds after stabilising.


Hijacking my own thread:

I don't suppose anyone knows how to make a Windows NTP server permit queries? Googling does not seem to reveal anything insightful. I know how to do this for ntpd but am stuck with dealing with a Windows NTP server right now.


Why does VM ware emulate the hardware clock rather than giving (possibly slightly debounced) access to the real system clock?


I suppose for vMotion purposes. A VM is not tied to a physical machine.


@CockroachDB dev:

Has CockroachDB some health status page or REST api? (like Nginx/Apache/Redis/Memcached or a special table like MySQL)

It would be helpful to monitor the CockroachDB database in production.

I see there is some feature inbuild, but it only sends that data home to your server for analytics. (can be turned off) https://www.cockroachlabs.com/docs/diagnostics-reporting.htm...


CockroachDB has a rather nice admin interface that monitors the health of the cluster.

https://www.cockroachlabs.com/docs/explore-the-admin-ui.html

There's also a lot of rpc end points used for the admin UI that can be queried to get more fine grain info. However, they're primarily for internal use and might change in the future.

https://github.com/cockroachdb/cockroach/blob/master/pkg/ser...

https://github.com/cockroachdb/cockroach/blob/master/pkg/ser...


We're still working on integration with other monitoring systems, but the one we've tested the most and documented is prometheus: https://www.cockroachlabs.com/docs/monitor-cockroachdb-with-...

Additionally, you can get some of the same status info on the dashboard using the `node status` command (https://www.cockroachlabs.com/docs/view-node-details.html).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: