I'm running a small Exhibitor/Zookeeper cluster dockerized (5 nodes, several hun...

saryant · on March 29, 2015

Having worked with etcd in production for the last few months, I have to agree. The CoreOS stack needs some more time to marinate.

toomuchtodo · on March 29, 2015

Thanks for this comment. Glad to know I made the right choice.

I'm not saying etcd won't ever overshadow Zookeeper, it probably will with the momentum behind it, but as an ops guys, I wasn't willing to bet production application service discovery on it.

eropple · on March 29, 2015

My distaste for the Go community is pretty well-established in these parts; I think worse-is-better is screwing us all, and etcd seems to me to be the worse-is-better Zookeeper. And for things that don't matter, sure, worse-is-better your life away; a Rails app can be whatever you want, but the infrastructure I manage had better be bulletproof. I won't say etcd will never be competitive, but without some significant changes, I don't see it getting my vote--and those changes are largely around the parts of the feature set that etcd doesn't support, at which point...why use it, anyway?

knite · on March 30, 2015

What particular issues have you run into with etcd and/or CoreOS?

saryant · on March 30, 2015

Lots of split brains. Serious bugs making it through the alpha and beta channels into stable (and our boxes auto-updating only to become useless). Fleet units dying purely due to problems with fleetd/systemd. A particularly painful one was an Akka deployment on top of CoreOS where a sidekick unit would fail to start because fleet hadn't actually copied the unit file to the remote host. Only happened with sidekicks but due to how we ran our networking, it effectively killed the application. Almost every redeploy required manually getting fleet to copy the unit over.

Gigablah · on March 30, 2015

Just to add on: I've had fleet misreport unit status and btrfs reporting lack of disk space for no apparent reason. Also the inability to restart individual failed units which are part of a global unit.

Also there was that one time they changed how cloud-config was parsed, so if "#cloud-config" wasn't on the very first line without preceeding spaces, initialisation would fail. That was when I switched the reboot strategy to manual.

ecnahc515 · on March 30, 2015

Btrfs is no longer the default for CoreOS for this reason. Overlayfs doesn't have this issue.

saryant · on March 30, 2015

Oh man, yes. I'd blocked all my scaring memories of btrfs biting me in the ass.

eropple · on March 30, 2015

Matches up pretty well with my experience, too. I do not trust fleet as far as I can throw it.

saryant · on March 30, 2015

Yeah, the whole project was something of a disaster. Eventually things stabilized a bit but every few weeks etcd or fleetd would throw a curveball and I'd lose a day of time chasing down the problem.